Upload
ngocong
View
223
Download
0
Embed Size (px)
Citation preview
AMD RYZENtrade PROCESSOR SOFTWARE OPTIMIZATIONPRESENTED BYKEN MITCHELL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 20193
bull Join AMD ISV Game Engineering team members for an introduction to the AMD Ryzentrade family of processors followed by advanced optimization topics Learn about the Ryzentrade line up of processors profiling tools and techniques to understand optimization opportunities and get a glimpse of the next generation of ldquoZen 2rdquo x86 core architecture Gain insight into code optimization opportunities and lessons learned with examples including CC++ assembly and hardware performance-monitoring counters
bull Ken Mitchell is a Senior Member of Technical Staff in the Radeontrade Technologies GroupAMD ISV Game Engineering team where he focuses on helping game developers utilize AMD CPU cores efficiently His previous work includes automating amp analyzing PC applications for performance projections of future AMD products as well as developing benchmarks Ken studied computer science at the University of Texas at Austin
ABSTRACT
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 20194
bull Success Storiesbull ldquoZenrdquo Family Processorsbull AMD μProf Profilerbull Optimizations amp Lessons Learnedbull Roadmapbull Questionsbull Giveaway
AGENDA
SUCCESS STORIES
5
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
+9 +12
0
10
20
30
40
50
60
70
DX9 DX11 VK Beta
Aver
age
FPS
DOTA 2version 3359 gameplay 721B
(higher is better)
+107
020406080
100120140160180
v399 non-SSE2
v3100 SSE2
Elap
sed
Tim
e (s
)
AudacityLAME MP3 Encode x86
(lower is better)
SUCCESS STORIES
+36
0
10
20
30
40
50
60
70
gxMT Off gxMT On
Aver
age
FPS
World of Warcraftv81 DX12
(higher is better)
see disclaimer and testing details in next slide
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 20197
bull World of Warcraftbull httpscommunityamdcomcommunitygamingblog20190208ryzen-processors-are-now-optimized-for-both-alliance-and-horde-in-
world-of-warcraftbull Testing done by AMD performance labs January 4 2019 on the following system PC manufacturers may vary configurations yielding
different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 GeForce GTX 1080 (driver 39882) MSI B450 GAMING PLUS Socket AM4 motherboard Samsung 850 SSD Windows 10 x64 Pro (RS4)
bull DOTA 2bull Testing done by Kenneth Mitchell February 18 2019 on the following system PC manufacturers may vary configurations yielding
different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 Radeontrade R7 SSD 1920x1080 resolution Best Looking DOTA 2 version 3359 gameplay 721B Steam Beta Built Feb 15 2019 at 150334
bull Audacity Lame MP3 Encodebull httpswwwpcpercomreviewsProcessorsRyzen-7-2700X-and-Ryzen-5-2600X-Review-Zen-MaturesMedia-Encoding-and-Renderingbull AMD Ryzentrade 7 2700X Processor 16GB Corsair Vengeance DDR4-3200 at 2933 NVIDIA GeForce GTX 1080Ti 11GB (driver 39077)
ASUS Crosshair VII Hero motherboard Corsair Neutron XTi 480 SSD Windows 10 Pro x64 RS3 fully updated as of 412017
DISCLAIMER
ldquoZENrdquo FAMILY PROCESSORS
8
DATA FLOW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
ldquoPINNACLE RIDGErdquo 8 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
memclk
IO HubController
32Bcycle
lclk
cclk 43 GHz 37 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 615 MHz
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201911
ldquoRAVEN RIDGErdquo 4 CORE PROCESSOR
Data Fabric
4M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
cclk
IO HubController
32Bcycle
lclk
GFX9
Media
32Bcycle
32Bcycle
cclk 39 GHz 36 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 496 MHzsclk 125 GHz (11CU = 704 shaders)
fclk
uclk memclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201912
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
DRAMChannel
16Bcycle32Bcycle
IO HubController
32Bcycle
lclk
32Bcycle16Bcycle off die
GMI 16Bcycle off die
cclk 44 GHz 35 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 727 MHz
Unified Memory
Controllermemclk
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DDR4
Cha
nnel
B
13
bull 16 cores 32 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA or UMAbull The AMD Ryzentrade Master Utility Game Mode may
improve performance by effectively improving memory latency by restricting the system to use only die0 processors while in NUMA mode
bull 64 PCIereg Gen3 lanes
bull 50GBs die-to-die bandwidth (bi-directional) bull RP2-21 DRAM latency for Die0 or Die1 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die1 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 50GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2950X and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (14-14-14-28-1T) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti(driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-21
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCXDDR
Die 0
Die 1
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
IO
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
DDR
infin
infin
infin
infin
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 20193
bull Join AMD ISV Game Engineering team members for an introduction to the AMD Ryzentrade family of processors followed by advanced optimization topics Learn about the Ryzentrade line up of processors profiling tools and techniques to understand optimization opportunities and get a glimpse of the next generation of ldquoZen 2rdquo x86 core architecture Gain insight into code optimization opportunities and lessons learned with examples including CC++ assembly and hardware performance-monitoring counters
bull Ken Mitchell is a Senior Member of Technical Staff in the Radeontrade Technologies GroupAMD ISV Game Engineering team where he focuses on helping game developers utilize AMD CPU cores efficiently His previous work includes automating amp analyzing PC applications for performance projections of future AMD products as well as developing benchmarks Ken studied computer science at the University of Texas at Austin
ABSTRACT
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 20194
bull Success Storiesbull ldquoZenrdquo Family Processorsbull AMD μProf Profilerbull Optimizations amp Lessons Learnedbull Roadmapbull Questionsbull Giveaway
AGENDA
SUCCESS STORIES
5
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
+9 +12
0
10
20
30
40
50
60
70
DX9 DX11 VK Beta
Aver
age
FPS
DOTA 2version 3359 gameplay 721B
(higher is better)
+107
020406080
100120140160180
v399 non-SSE2
v3100 SSE2
Elap
sed
Tim
e (s
)
AudacityLAME MP3 Encode x86
(lower is better)
SUCCESS STORIES
+36
0
10
20
30
40
50
60
70
gxMT Off gxMT On
Aver
age
FPS
World of Warcraftv81 DX12
(higher is better)
see disclaimer and testing details in next slide
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 20197
bull World of Warcraftbull httpscommunityamdcomcommunitygamingblog20190208ryzen-processors-are-now-optimized-for-both-alliance-and-horde-in-
world-of-warcraftbull Testing done by AMD performance labs January 4 2019 on the following system PC manufacturers may vary configurations yielding
different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 GeForce GTX 1080 (driver 39882) MSI B450 GAMING PLUS Socket AM4 motherboard Samsung 850 SSD Windows 10 x64 Pro (RS4)
bull DOTA 2bull Testing done by Kenneth Mitchell February 18 2019 on the following system PC manufacturers may vary configurations yielding
different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 Radeontrade R7 SSD 1920x1080 resolution Best Looking DOTA 2 version 3359 gameplay 721B Steam Beta Built Feb 15 2019 at 150334
bull Audacity Lame MP3 Encodebull httpswwwpcpercomreviewsProcessorsRyzen-7-2700X-and-Ryzen-5-2600X-Review-Zen-MaturesMedia-Encoding-and-Renderingbull AMD Ryzentrade 7 2700X Processor 16GB Corsair Vengeance DDR4-3200 at 2933 NVIDIA GeForce GTX 1080Ti 11GB (driver 39077)
ASUS Crosshair VII Hero motherboard Corsair Neutron XTi 480 SSD Windows 10 Pro x64 RS3 fully updated as of 412017
DISCLAIMER
ldquoZENrdquo FAMILY PROCESSORS
8
DATA FLOW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
ldquoPINNACLE RIDGErdquo 8 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
memclk
IO HubController
32Bcycle
lclk
cclk 43 GHz 37 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 615 MHz
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201911
ldquoRAVEN RIDGErdquo 4 CORE PROCESSOR
Data Fabric
4M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
cclk
IO HubController
32Bcycle
lclk
GFX9
Media
32Bcycle
32Bcycle
cclk 39 GHz 36 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 496 MHzsclk 125 GHz (11CU = 704 shaders)
fclk
uclk memclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201912
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
DRAMChannel
16Bcycle32Bcycle
IO HubController
32Bcycle
lclk
32Bcycle16Bcycle off die
GMI 16Bcycle off die
cclk 44 GHz 35 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 727 MHz
Unified Memory
Controllermemclk
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DDR4
Cha
nnel
B
13
bull 16 cores 32 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA or UMAbull The AMD Ryzentrade Master Utility Game Mode may
improve performance by effectively improving memory latency by restricting the system to use only die0 processors while in NUMA mode
bull 64 PCIereg Gen3 lanes
bull 50GBs die-to-die bandwidth (bi-directional) bull RP2-21 DRAM latency for Die0 or Die1 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die1 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 50GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2950X and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (14-14-14-28-1T) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti(driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-21
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCXDDR
Die 0
Die 1
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
IO
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
DDR
infin
infin
infin
infin
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 20194
bull Success Storiesbull ldquoZenrdquo Family Processorsbull AMD μProf Profilerbull Optimizations amp Lessons Learnedbull Roadmapbull Questionsbull Giveaway
AGENDA
SUCCESS STORIES
5
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
+9 +12
0
10
20
30
40
50
60
70
DX9 DX11 VK Beta
Aver
age
FPS
DOTA 2version 3359 gameplay 721B
(higher is better)
+107
020406080
100120140160180
v399 non-SSE2
v3100 SSE2
Elap
sed
Tim
e (s
)
AudacityLAME MP3 Encode x86
(lower is better)
SUCCESS STORIES
+36
0
10
20
30
40
50
60
70
gxMT Off gxMT On
Aver
age
FPS
World of Warcraftv81 DX12
(higher is better)
see disclaimer and testing details in next slide
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 20197
bull World of Warcraftbull httpscommunityamdcomcommunitygamingblog20190208ryzen-processors-are-now-optimized-for-both-alliance-and-horde-in-
world-of-warcraftbull Testing done by AMD performance labs January 4 2019 on the following system PC manufacturers may vary configurations yielding
different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 GeForce GTX 1080 (driver 39882) MSI B450 GAMING PLUS Socket AM4 motherboard Samsung 850 SSD Windows 10 x64 Pro (RS4)
bull DOTA 2bull Testing done by Kenneth Mitchell February 18 2019 on the following system PC manufacturers may vary configurations yielding
different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 Radeontrade R7 SSD 1920x1080 resolution Best Looking DOTA 2 version 3359 gameplay 721B Steam Beta Built Feb 15 2019 at 150334
bull Audacity Lame MP3 Encodebull httpswwwpcpercomreviewsProcessorsRyzen-7-2700X-and-Ryzen-5-2600X-Review-Zen-MaturesMedia-Encoding-and-Renderingbull AMD Ryzentrade 7 2700X Processor 16GB Corsair Vengeance DDR4-3200 at 2933 NVIDIA GeForce GTX 1080Ti 11GB (driver 39077)
ASUS Crosshair VII Hero motherboard Corsair Neutron XTi 480 SSD Windows 10 Pro x64 RS3 fully updated as of 412017
DISCLAIMER
ldquoZENrdquo FAMILY PROCESSORS
8
DATA FLOW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
ldquoPINNACLE RIDGErdquo 8 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
memclk
IO HubController
32Bcycle
lclk
cclk 43 GHz 37 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 615 MHz
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201911
ldquoRAVEN RIDGErdquo 4 CORE PROCESSOR
Data Fabric
4M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
cclk
IO HubController
32Bcycle
lclk
GFX9
Media
32Bcycle
32Bcycle
cclk 39 GHz 36 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 496 MHzsclk 125 GHz (11CU = 704 shaders)
fclk
uclk memclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201912
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
DRAMChannel
16Bcycle32Bcycle
IO HubController
32Bcycle
lclk
32Bcycle16Bcycle off die
GMI 16Bcycle off die
cclk 44 GHz 35 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 727 MHz
Unified Memory
Controllermemclk
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DDR4
Cha
nnel
B
13
bull 16 cores 32 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA or UMAbull The AMD Ryzentrade Master Utility Game Mode may
improve performance by effectively improving memory latency by restricting the system to use only die0 processors while in NUMA mode
bull 64 PCIereg Gen3 lanes
bull 50GBs die-to-die bandwidth (bi-directional) bull RP2-21 DRAM latency for Die0 or Die1 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die1 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 50GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2950X and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (14-14-14-28-1T) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti(driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-21
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCXDDR
Die 0
Die 1
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
IO
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
DDR
infin
infin
infin
infin
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
SUCCESS STORIES
5
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
+9 +12
0
10
20
30
40
50
60
70
DX9 DX11 VK Beta
Aver
age
FPS
DOTA 2version 3359 gameplay 721B
(higher is better)
+107
020406080
100120140160180
v399 non-SSE2
v3100 SSE2
Elap
sed
Tim
e (s
)
AudacityLAME MP3 Encode x86
(lower is better)
SUCCESS STORIES
+36
0
10
20
30
40
50
60
70
gxMT Off gxMT On
Aver
age
FPS
World of Warcraftv81 DX12
(higher is better)
see disclaimer and testing details in next slide
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 20197
bull World of Warcraftbull httpscommunityamdcomcommunitygamingblog20190208ryzen-processors-are-now-optimized-for-both-alliance-and-horde-in-
world-of-warcraftbull Testing done by AMD performance labs January 4 2019 on the following system PC manufacturers may vary configurations yielding
different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 GeForce GTX 1080 (driver 39882) MSI B450 GAMING PLUS Socket AM4 motherboard Samsung 850 SSD Windows 10 x64 Pro (RS4)
bull DOTA 2bull Testing done by Kenneth Mitchell February 18 2019 on the following system PC manufacturers may vary configurations yielding
different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 Radeontrade R7 SSD 1920x1080 resolution Best Looking DOTA 2 version 3359 gameplay 721B Steam Beta Built Feb 15 2019 at 150334
bull Audacity Lame MP3 Encodebull httpswwwpcpercomreviewsProcessorsRyzen-7-2700X-and-Ryzen-5-2600X-Review-Zen-MaturesMedia-Encoding-and-Renderingbull AMD Ryzentrade 7 2700X Processor 16GB Corsair Vengeance DDR4-3200 at 2933 NVIDIA GeForce GTX 1080Ti 11GB (driver 39077)
ASUS Crosshair VII Hero motherboard Corsair Neutron XTi 480 SSD Windows 10 Pro x64 RS3 fully updated as of 412017
DISCLAIMER
ldquoZENrdquo FAMILY PROCESSORS
8
DATA FLOW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
ldquoPINNACLE RIDGErdquo 8 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
memclk
IO HubController
32Bcycle
lclk
cclk 43 GHz 37 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 615 MHz
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201911
ldquoRAVEN RIDGErdquo 4 CORE PROCESSOR
Data Fabric
4M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
cclk
IO HubController
32Bcycle
lclk
GFX9
Media
32Bcycle
32Bcycle
cclk 39 GHz 36 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 496 MHzsclk 125 GHz (11CU = 704 shaders)
fclk
uclk memclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201912
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
DRAMChannel
16Bcycle32Bcycle
IO HubController
32Bcycle
lclk
32Bcycle16Bcycle off die
GMI 16Bcycle off die
cclk 44 GHz 35 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 727 MHz
Unified Memory
Controllermemclk
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DDR4
Cha
nnel
B
13
bull 16 cores 32 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA or UMAbull The AMD Ryzentrade Master Utility Game Mode may
improve performance by effectively improving memory latency by restricting the system to use only die0 processors while in NUMA mode
bull 64 PCIereg Gen3 lanes
bull 50GBs die-to-die bandwidth (bi-directional) bull RP2-21 DRAM latency for Die0 or Die1 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die1 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 50GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2950X and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (14-14-14-28-1T) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti(driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-21
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCXDDR
Die 0
Die 1
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
IO
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
DDR
infin
infin
infin
infin
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
+9 +12
0
10
20
30
40
50
60
70
DX9 DX11 VK Beta
Aver
age
FPS
DOTA 2version 3359 gameplay 721B
(higher is better)
+107
020406080
100120140160180
v399 non-SSE2
v3100 SSE2
Elap
sed
Tim
e (s
)
AudacityLAME MP3 Encode x86
(lower is better)
SUCCESS STORIES
+36
0
10
20
30
40
50
60
70
gxMT Off gxMT On
Aver
age
FPS
World of Warcraftv81 DX12
(higher is better)
see disclaimer and testing details in next slide
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 20197
bull World of Warcraftbull httpscommunityamdcomcommunitygamingblog20190208ryzen-processors-are-now-optimized-for-both-alliance-and-horde-in-
world-of-warcraftbull Testing done by AMD performance labs January 4 2019 on the following system PC manufacturers may vary configurations yielding
different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 GeForce GTX 1080 (driver 39882) MSI B450 GAMING PLUS Socket AM4 motherboard Samsung 850 SSD Windows 10 x64 Pro (RS4)
bull DOTA 2bull Testing done by Kenneth Mitchell February 18 2019 on the following system PC manufacturers may vary configurations yielding
different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 Radeontrade R7 SSD 1920x1080 resolution Best Looking DOTA 2 version 3359 gameplay 721B Steam Beta Built Feb 15 2019 at 150334
bull Audacity Lame MP3 Encodebull httpswwwpcpercomreviewsProcessorsRyzen-7-2700X-and-Ryzen-5-2600X-Review-Zen-MaturesMedia-Encoding-and-Renderingbull AMD Ryzentrade 7 2700X Processor 16GB Corsair Vengeance DDR4-3200 at 2933 NVIDIA GeForce GTX 1080Ti 11GB (driver 39077)
ASUS Crosshair VII Hero motherboard Corsair Neutron XTi 480 SSD Windows 10 Pro x64 RS3 fully updated as of 412017
DISCLAIMER
ldquoZENrdquo FAMILY PROCESSORS
8
DATA FLOW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
ldquoPINNACLE RIDGErdquo 8 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
memclk
IO HubController
32Bcycle
lclk
cclk 43 GHz 37 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 615 MHz
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201911
ldquoRAVEN RIDGErdquo 4 CORE PROCESSOR
Data Fabric
4M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
cclk
IO HubController
32Bcycle
lclk
GFX9
Media
32Bcycle
32Bcycle
cclk 39 GHz 36 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 496 MHzsclk 125 GHz (11CU = 704 shaders)
fclk
uclk memclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201912
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
DRAMChannel
16Bcycle32Bcycle
IO HubController
32Bcycle
lclk
32Bcycle16Bcycle off die
GMI 16Bcycle off die
cclk 44 GHz 35 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 727 MHz
Unified Memory
Controllermemclk
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DDR4
Cha
nnel
B
13
bull 16 cores 32 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA or UMAbull The AMD Ryzentrade Master Utility Game Mode may
improve performance by effectively improving memory latency by restricting the system to use only die0 processors while in NUMA mode
bull 64 PCIereg Gen3 lanes
bull 50GBs die-to-die bandwidth (bi-directional) bull RP2-21 DRAM latency for Die0 or Die1 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die1 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 50GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2950X and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (14-14-14-28-1T) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti(driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-21
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCXDDR
Die 0
Die 1
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
IO
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
DDR
infin
infin
infin
infin
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 20197
bull World of Warcraftbull httpscommunityamdcomcommunitygamingblog20190208ryzen-processors-are-now-optimized-for-both-alliance-and-horde-in-
world-of-warcraftbull Testing done by AMD performance labs January 4 2019 on the following system PC manufacturers may vary configurations yielding
different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 GeForce GTX 1080 (driver 39882) MSI B450 GAMING PLUS Socket AM4 motherboard Samsung 850 SSD Windows 10 x64 Pro (RS4)
bull DOTA 2bull Testing done by Kenneth Mitchell February 18 2019 on the following system PC manufacturers may vary configurations yielding
different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 Radeontrade R7 SSD 1920x1080 resolution Best Looking DOTA 2 version 3359 gameplay 721B Steam Beta Built Feb 15 2019 at 150334
bull Audacity Lame MP3 Encodebull httpswwwpcpercomreviewsProcessorsRyzen-7-2700X-and-Ryzen-5-2600X-Review-Zen-MaturesMedia-Encoding-and-Renderingbull AMD Ryzentrade 7 2700X Processor 16GB Corsair Vengeance DDR4-3200 at 2933 NVIDIA GeForce GTX 1080Ti 11GB (driver 39077)
ASUS Crosshair VII Hero motherboard Corsair Neutron XTi 480 SSD Windows 10 Pro x64 RS3 fully updated as of 412017
DISCLAIMER
ldquoZENrdquo FAMILY PROCESSORS
8
DATA FLOW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
ldquoPINNACLE RIDGErdquo 8 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
memclk
IO HubController
32Bcycle
lclk
cclk 43 GHz 37 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 615 MHz
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201911
ldquoRAVEN RIDGErdquo 4 CORE PROCESSOR
Data Fabric
4M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
cclk
IO HubController
32Bcycle
lclk
GFX9
Media
32Bcycle
32Bcycle
cclk 39 GHz 36 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 496 MHzsclk 125 GHz (11CU = 704 shaders)
fclk
uclk memclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201912
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
DRAMChannel
16Bcycle32Bcycle
IO HubController
32Bcycle
lclk
32Bcycle16Bcycle off die
GMI 16Bcycle off die
cclk 44 GHz 35 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 727 MHz
Unified Memory
Controllermemclk
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DDR4
Cha
nnel
B
13
bull 16 cores 32 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA or UMAbull The AMD Ryzentrade Master Utility Game Mode may
improve performance by effectively improving memory latency by restricting the system to use only die0 processors while in NUMA mode
bull 64 PCIereg Gen3 lanes
bull 50GBs die-to-die bandwidth (bi-directional) bull RP2-21 DRAM latency for Die0 or Die1 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die1 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 50GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2950X and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (14-14-14-28-1T) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti(driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-21
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCXDDR
Die 0
Die 1
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
IO
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
DDR
infin
infin
infin
infin
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
ldquoZENrdquo FAMILY PROCESSORS
8
DATA FLOW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
ldquoPINNACLE RIDGErdquo 8 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
memclk
IO HubController
32Bcycle
lclk
cclk 43 GHz 37 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 615 MHz
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201911
ldquoRAVEN RIDGErdquo 4 CORE PROCESSOR
Data Fabric
4M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
cclk
IO HubController
32Bcycle
lclk
GFX9
Media
32Bcycle
32Bcycle
cclk 39 GHz 36 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 496 MHzsclk 125 GHz (11CU = 704 shaders)
fclk
uclk memclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201912
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
DRAMChannel
16Bcycle32Bcycle
IO HubController
32Bcycle
lclk
32Bcycle16Bcycle off die
GMI 16Bcycle off die
cclk 44 GHz 35 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 727 MHz
Unified Memory
Controllermemclk
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DDR4
Cha
nnel
B
13
bull 16 cores 32 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA or UMAbull The AMD Ryzentrade Master Utility Game Mode may
improve performance by effectively improving memory latency by restricting the system to use only die0 processors while in NUMA mode
bull 64 PCIereg Gen3 lanes
bull 50GBs die-to-die bandwidth (bi-directional) bull RP2-21 DRAM latency for Die0 or Die1 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die1 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 50GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2950X and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (14-14-14-28-1T) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti(driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-21
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCXDDR
Die 0
Die 1
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
IO
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
DDR
infin
infin
infin
infin
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
DATA FLOW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
ldquoPINNACLE RIDGErdquo 8 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
memclk
IO HubController
32Bcycle
lclk
cclk 43 GHz 37 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 615 MHz
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201911
ldquoRAVEN RIDGErdquo 4 CORE PROCESSOR
Data Fabric
4M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
cclk
IO HubController
32Bcycle
lclk
GFX9
Media
32Bcycle
32Bcycle
cclk 39 GHz 36 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 496 MHzsclk 125 GHz (11CU = 704 shaders)
fclk
uclk memclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201912
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
DRAMChannel
16Bcycle32Bcycle
IO HubController
32Bcycle
lclk
32Bcycle16Bcycle off die
GMI 16Bcycle off die
cclk 44 GHz 35 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 727 MHz
Unified Memory
Controllermemclk
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DDR4
Cha
nnel
B
13
bull 16 cores 32 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA or UMAbull The AMD Ryzentrade Master Utility Game Mode may
improve performance by effectively improving memory latency by restricting the system to use only die0 processors while in NUMA mode
bull 64 PCIereg Gen3 lanes
bull 50GBs die-to-die bandwidth (bi-directional) bull RP2-21 DRAM latency for Die0 or Die1 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die1 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 50GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2950X and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (14-14-14-28-1T) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti(driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-21
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCXDDR
Die 0
Die 1
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
IO
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
DDR
infin
infin
infin
infin
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
ldquoPINNACLE RIDGErdquo 8 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
memclk
IO HubController
32Bcycle
lclk
cclk 43 GHz 37 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 615 MHz
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201911
ldquoRAVEN RIDGErdquo 4 CORE PROCESSOR
Data Fabric
4M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
cclk
IO HubController
32Bcycle
lclk
GFX9
Media
32Bcycle
32Bcycle
cclk 39 GHz 36 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 496 MHzsclk 125 GHz (11CU = 704 shaders)
fclk
uclk memclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201912
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
DRAMChannel
16Bcycle32Bcycle
IO HubController
32Bcycle
lclk
32Bcycle16Bcycle off die
GMI 16Bcycle off die
cclk 44 GHz 35 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 727 MHz
Unified Memory
Controllermemclk
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DDR4
Cha
nnel
B
13
bull 16 cores 32 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA or UMAbull The AMD Ryzentrade Master Utility Game Mode may
improve performance by effectively improving memory latency by restricting the system to use only die0 processors while in NUMA mode
bull 64 PCIereg Gen3 lanes
bull 50GBs die-to-die bandwidth (bi-directional) bull RP2-21 DRAM latency for Die0 or Die1 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die1 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 50GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2950X and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (14-14-14-28-1T) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti(driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-21
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCXDDR
Die 0
Die 1
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
IO
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
DDR
infin
infin
infin
infin
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201911
ldquoRAVEN RIDGErdquo 4 CORE PROCESSOR
Data Fabric
4M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
Unified Memory
Controller
DRAMChannel
16Bcycle32Bcycle
cclk
IO HubController
32Bcycle
lclk
GFX9
Media
32Bcycle
32Bcycle
cclk 39 GHz 36 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 496 MHzsclk 125 GHz (11CU = 704 shaders)
fclk
uclk memclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201912
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
DRAMChannel
16Bcycle32Bcycle
IO HubController
32Bcycle
lclk
32Bcycle16Bcycle off die
GMI 16Bcycle off die
cclk 44 GHz 35 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 727 MHz
Unified Memory
Controllermemclk
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DDR4
Cha
nnel
B
13
bull 16 cores 32 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA or UMAbull The AMD Ryzentrade Master Utility Game Mode may
improve performance by effectively improving memory latency by restricting the system to use only die0 processors while in NUMA mode
bull 64 PCIereg Gen3 lanes
bull 50GBs die-to-die bandwidth (bi-directional) bull RP2-21 DRAM latency for Die0 or Die1 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die1 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 50GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2950X and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (14-14-14-28-1T) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti(driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-21
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCXDDR
Die 0
Die 1
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
IO
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
DDR
infin
infin
infin
infin
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201912
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
Data Fabric
8M L3
I+D Cache 16-way
32Bcycle32Bcycle512K L2
I+D Cache8-way
64KI-Cache4-way
32KD-Cache
8-way
32Bcycle32B fetch
32Bcycle216B load
116B store
DRAMChannel
16Bcycle32Bcycle
IO HubController
32Bcycle
lclk
32Bcycle16Bcycle off die
GMI 16Bcycle off die
cclk 44 GHz 35 GHzfclk=uclk=memclk 147 GHz (DDR4-2933)lclk 727 MHz
Unified Memory
Controllermemclk
fclk
uclk
cclk
l3clk
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DDR4
Cha
nnel
B
13
bull 16 cores 32 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA or UMAbull The AMD Ryzentrade Master Utility Game Mode may
improve performance by effectively improving memory latency by restricting the system to use only die0 processors while in NUMA mode
bull 64 PCIereg Gen3 lanes
bull 50GBs die-to-die bandwidth (bi-directional) bull RP2-21 DRAM latency for Die0 or Die1 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die1 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 50GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2950X and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (14-14-14-28-1T) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti(driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-21
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCXDDR
Die 0
Die 1
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
IO
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
DDR
infin
infin
infin
infin
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DDR4
Cha
nnel
B
13
bull 16 cores 32 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA or UMAbull The AMD Ryzentrade Master Utility Game Mode may
improve performance by effectively improving memory latency by restricting the system to use only die0 processors while in NUMA mode
bull 64 PCIereg Gen3 lanes
bull 50GBs die-to-die bandwidth (bi-directional) bull RP2-21 DRAM latency for Die0 or Die1 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die1 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 50GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2950X and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (14-14-14-28-1T) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti(driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-21
RYZENtrade THREADRIPPERtrade 16 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCXDDR
Die 0
Die 1
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
IO
16 PCIereg Gen 3 Lanes
16 PCIereg Gen 3 Lanes
DDR
infin
infin
infin
infin
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
16 PCIereg Gen 3 Lanes16 PCIereg Gen 3 Lanes
DDR4
Cha
nnel
B
14
bull 32 cores 64 threads
bull 4 DDR Channelsbull ~64ns near memory bull ~105ns far memory bull NUMA only
bull Windows 10 2019H1 Insider Preview has improved NUMA support
bull The AMD Ryzentrade Master Utility Game Mode may improve performance by effectively improving memory latency by restricting the system to use only die0 processors
bull 64 PCIereg Gen3 lanes
bull 25GBs die-to-die bandwidth (bi-directional) bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory
pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
RYZENtrade THREADRIPPERtrade 32 CORE PROCESSOR
CCX
CCX
DDR4
Cha
nnel
A
DDR4
Cha
nnel
CDD
R4 C
hann
el D
IO
CCX
CCX
CCX
CCX
CCX
CCX
Die 0
Die 1Die 2
Die 3
infin
16 PCIereg Gen 3 Lanes
infin
16 PCIereg Gen 3 Lanes
infin
infininfin
infin
infininfin
infin
infininfin
infin
IODDR
DDR
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
MICROARCHITECTURE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull All structures available in 1T modebull Front End Queues are round robin with priority overrides bull High throughput from SMTbull AMD Ryzentrade achieved a greater than 52 increase in IPC
than previous generation AMD processorsbull Testing by AMD Performance labs PC manufacturers may vary configurations yielding different results System
configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint06) and Windowsreg 10 x64 RS1 (Cinebench R15) Updated Feb 28 2017 Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoPiledriverrdquo architecture is +52 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz Generational IPC uplift for the ldquoZenrdquo architecture vs ldquoExcavatorrdquo architecture is +64 as measured with Cinebench R15 1T and also +64 with an estimated SPECint_base2006 score compiled with GCC 46 ndashO2 at a fixed 34GHz System configs AMD reference motherboard(s) AMD Radeontrade R9 290X GPU 8GB DDR4-2667 (ldquoZenrdquo)8GB DDR3-2133 (ldquoExcavatorrdquo)8GB DDR3-1866 (ldquoPiledriverrdquo) Ubuntu Linux 16x (SPECint_base2006 estimate) and Windowsreg 10 x64 RS1 (Cinebench R15) SPECint_base2006 estimates ldquoZenrdquo vs ldquoPiledriverrdquo (315 vs 207 | +52) ldquoZenrdquo vs ldquoExcavatorrdquo (315 vs 192 | +64) Cinebench R15 1t scores ldquoZenrdquo vs ldquoPiledriverrdquo (139 vs 79 both at 34G | +76) ldquoZenrdquo vs ldquoExcavatorrdquo (160 vs 975 both at 40G| +64) RZN-11
ZEN SMT DESIGN OVERVIEW
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201917
Family 17h added the AMD vendor specific instruction CLZERO and deprecated support for FMA4 TBM amp XOP
INSTRUCTION SET EVOLUTION
YEAR FAMILY MODELS PRODUCT FAMILY EXAMPLE PRODUCT
ADX
CLFL
USH
OPT
RDSE
ED SHA
SMAP
XGET
BVXS
AVEC
XSAV
ESAV
X2BM
I2M
OVB
ERD
RND
SMEP
FSG
SBAS
EXS
AVEO
PTBM
IFM
AF1
6C AES
AVX
OSX
SAVE
PCLM
ULQ
DQSS
E41
SSE4
2XS
AVE
SSSE
3CL
ZERO
FMA4
TBM
XOP
2017 17h 00h-0Fh ldquoSummit Ridgerdquo rdquoPinnacle Ridgerdquo Ryzentrade 7 2700X 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 02015 15h 60h-6Fh ldquoCarrizordquo ldquoBristol Ridgerdquo A12-9800 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12014 15h 30h-3Fh ldquoKaverirdquo ldquoGodavarirdquo A10-7890K 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12012 15h 00h-0Fh ldquoVisherardquo FX-8370 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 12011 15h 00h-0Fh ldquoZambezirdquo FX-8150 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 0 1 0 12013 16h 00h-0Fh ldquoKabinirdquo A6-1450 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 1 1 1 1 1 1 1 1 1 0 0 0 02011 14h 00h-0Fh ldquoOntariordquo E-450 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 02011 12h ldquoLlanordquo A8-3870 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 02009 10h ldquoGreyhoundrdquo Phenom II X4 955 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201918
bull There can be a significant penalty for mixing SSE and AVX instructions when the upper 128 bits of the YMM registers contain non-zero databull Avoid mixing legacy SIMD and AVX instructionsbull Zero the upper 128 bits of all YMM registers before
executing any legacy SIMD instructions by using the VZEROUPPER or VZEROALL instruction
bull Zero the upper 128 bits of all YMM registers after leaving a legacy SIMD section of code by using the VZEROUPPER or VZEROALL instruction
bull Results from x86 processors from different vendors may not match exactly for instructions RCPPS RCPSS RSQRTPS RSQRTSS which define relative error asbull |Relative Error| lt= 152^-12
FLOAT INSTRUCTIONS
36 Entry Scheduler
160 entry Physical Register File
MUL0
ADD0
MUL1
ADD1
Load Convert Unit
Forwarding Muxes
DecodeRename192 Entry
Retire Queue
4 fp micro-op dispatch8 micro-op retire
128 bit loads
Int to FP
Fp to Int
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201919
bull AMD Vendor specific instructionbull An atomic MOVNT streaming store type instruction
that writes an entire 64B cache line full of zeros to memory and clears poisoned status
bull Intended to recover from some otherwise fatal Machine Check Architecture (MCA) errors caused by uncorrectable corrupt memorybull Example kill a corrupt user process but keep the
system runningbull Execution Path Store Queue Store Commit Write
Combining Buffer L2 Data Fabric Memorybull Is non-cacheable unlike the PowerPC DCBZ
instructionbull Use memset rather than the CLZERO intrinsic to
quickly zero memory
CLZERO INSTRUCTION
Load QueueStore
Queue
L1L2 TLBMicro-tags
Page Walker
L0 Pick L1 Pick
TLB0
DAT0
TLB1
DAT1
32K Data Cache
Prefetch
32 bytes tofrom L2
AGU0 AGU1 To Ex
To FP
WCB
To L2
MAB
Store Pipe Pick
STP
StoreCommit
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201920
bull Cache line size is 64 Bytesbull 2 cpu clock cycles to move a single cache line
bull L2 is inclusive of L1bull lines filled into L1 are also filled into L2
bull L3 is filled from L2 victims of all 4 cores within a CCXbull L2 tags are duplicated in the L3 for fast cache transfers
within a CCXbull And fast probe filtering for Ryzentrade Threadrippertrade
and Epyctradebull L1 capacity evictions may cause L2 capacity evictions and L3
capacity evictions
CACHE LATENCYLevel Count Capacity Sets Ways Line Size Latencyuop 8 2 K uops 32 8 8 uops NAL1I 8 64 KB 256 4 64 B 4 clocksL1D 8 32 KB 64 8 64 B 4 clocksL2U 8 512 KB 1024 8 64 B 12 clocksL3U 2 8 MB 8192 16 64 B 35 clocks
L1D L2U L3UAMD Ryzentrade 7
1800X 4 17 40
AMD Ryzentrade 7 2700X 4 12 35
0
10
20
30
40
core
clo
ck c
ycle
s
Cache Latency(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201921
bull ~64ns near memory bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local
memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM LOCAL DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201922
bull Refill from other CCX cost may be similar to memory latency
bull CCX Core Complexbull CCM Cache-Coherent Masterbull SDF Scalable Data Fabricbull CAKE Coherent AMD socKet Extenderbull IFIS Infinity Fabric Inter-Socket SerDesbull IFOP Infinity Fabric On-Package SerDesbull CS Coherent Slavebull UMC Unified Memory Controller
REFILL FROM OTHER LOCAL CCX
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
IFIS or IFOP
off-chip
DDR4 DDR4
CCX CCX
CCM CCM
CS CS
UMC UMC
CAKE SDF Transport Layer
23
REFILL FROM REMOTE DIE DRAM
DDR4 DDR4
CCX CCX
IFIS or IFOP
off-chip
CCM CCM
CS CS
UMC UMC
CAKESDF Transport Layer
bull ~105ns far memory
bull RP2-22 DRAM latency for Die0 or Die2 communicating with their respective local memory pool(s) approximately 64ns with DDR4-3200 DRAM Latency for Die0 or Die2 communicating with the other diersquos memory pool approximately 105ns with DDR4-3200 Die-to-die bandwidth of the Infinity Fabric with DDR4-3200 measured at approximately 25GBps AMD System configuration AMD Ryzentrade Threadrippertrade 2990WX and 1950X Corsair H100i CLC 4x8GB DDR4-3200 (16-18-18) Asus Zenith X399 Extreme (BIOS 0008) GeForce GTX 1080 Ti (driver 39836) Windowsreg 10 x64 1803 Samsung 850 Pro SSD Western Digital Black 2TB HDD Results may vary with configuration RP2-22
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
AMD UPROFPROFILER
24
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull Remote profilingbull Thread Concurrencybull Designed to support Multiple counters using the same event but different unit masks supportedbull ldquoAssess Performance (Extended)rdquo event based sampling profile updatedbull See httpsdeveloperamdcomamd-uprof
bull Open-Source Register Reference For AMD Family 17h Processors (Publication 56255)bull See httpsupportamdcomen-ussearchtech-docs
V20 NEW FEATURES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201926
THREAD CONCURRENCY EXAMPLE
bull Threads shown are hardware threads (aka logical processors)
bull Testing done by Kenneth Mitchell January 18 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable Wolfenstein 2 SAS Machine Finale 1080p High novsync
Threads Time (s)0 231 292 63 34 45 46 47 38 29 2
10 111 112 113 114 115 516 11
Excel Table
[CELLRANGE][CELLRANGE]gxMT OffgxMT On42685819+36Average FPSAudacityLAME MP3 Encode x86(lower is better)[CELLRANGE][CELLRANGE]v399 non-SSE2v3100 SSE217082+107Elapsed Time (s)DOTA 2version 3359 gameplay 721B(higher is better)[CELLRANGE][CELLRANGE][CELLRANGE]DX9DX11VK Beta538500972007449975894737472441259760168231346724802+9+12Average FPS
Cache Latency
(less is better)
AMD Ryzen 7 1800XL1DL2UL3U41740AMD Ryzen 7 2700XL1DL2UL3U41235
core clock cycles
Use Best Practices With Spinlocks
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3035472160599999327157098300000001+1018binaryElapsed Time (s)
Avoid Too Many Non-Temporal Streams
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter1043762781099999946849953659999997+123binaryElapsed Time (s)
Avoid False Sharing
(less is better)
Time (s)
[CELLRANGE][CELLRANGE]BeforeAfter3653349781746871802029999998+679binaryElapsed Time (s)
LAME 3100
WAV to MP3 Encode
(less is better)
seconds
[CELLRANGE][CELLRANGE][CELLRANGE][CELLRANGE]Release x86ReleaseSSE2 x86Release x64ReleaseSSE2 x6452444339+18+21+33Configuration PlatformElapsed Time (s)
call memcpy benchmark
(less is better)
before123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160298493310000000012984544199999999829931519298731242986168500000000229861001000000003298711200000000022985791000000000329851873000000002298462009999999992985273000000000329848584999999996298865169999999972986387929874961298603029999999992170826599999999821707155999999999217117382177458200000000221717016217157542172366421712521000000002217091620000000022170586200000000221710962217086832171891099999999921718105999999997217241630000000012171530699999999916836262599999998179156668180122503999999991821270400000000218391934118868760900000002187310382190086815999999991981300119999999920781456720638406620090221399999997203727829999999982102534100000000120796296900000002211863931214627455225410192999999992402853329999999922262032599999998226604082283871862313668949999999923624155200000001236290994999999992470916500000000224545351699999998252574968000000022478676339999999824981345225534405800000002236306513247089360000000012552199169999999725796205699999998260627645999999982578834909999999826052625800000001263334328999999982661500029999999826896637599999998268824239000000032823111639999999827691749399999999279610567282355348999999982851497070000000323903256200000001249804445000000012578647752607151860000000126335477999999998260772423000000022633501079999999926605552599999999268761978000000032715422529999999727145451228508495900000003279673181282394265284988235287808364999999992415943920000000125252356199999998260605581999999992632792350000000126602324299999999263374558000000012660925170000000226877799100000001271491164274126948000000012742222462878062742825621809999999828516016800000003288021081000000022904228150000000124432470200000001255233129000000012634784949999999926597249000000001268757642659713950000000126877771127147433799999998274286139000000022769850470000000127691927929038968199999999285074741287745597000000012906889659999999929342252100000003247167953999999995431009200000000154289093999999993542807980000000085429775199999999954285352999999992543102350000000025429181899999999754333602000000001543091070000000015429707000000000554284804000000006542917900000000045428211954333508000000004543119800000000025297111700000000347239383999999998472303219999999964722281299999999747233704999999997472265234730229599999999547250046000000001472470310000000024724266199999999747222689999999998472230449999999994739514500000000347228518999999993472527700000000024724615200000000651613316999999999after12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916024420801000000001244276609999999962442141224421441000000002244220692442934000000000224428771999999999244229889999999992442025799999999724426190000000001244231590000000012442771599999999924433330999999998244297150000000012443059600000000224424807999999998162804659999999981628582699999999916282533000000001162938671628145516293344999999999162867261628223199999999916280336162851370000000011628311316281346000000001162878631629122800000000216286962162824640000000012171489000000000221720438217366459999999992174692721729122000000003217331440000000022172436799999999821719466999999999217144682172064499999999821716383000000001217179499999999992172886321739630000000001217151609999999983528425299999999929857238000000001298636090000000022985926199999999829868822000000002298659080000000012987893799999999729868054000000002298618032985607700000000129862991999999999298593159999999982988368529879155000000002298705580000000032986950499999999840719522000000001352857180000000043529249899999999935298299000000002352993993530124299999999835312941999999996352980929999999973529456200000000335284358352917619999999983528649600000000135314313531168999999999835307901000000004352970510000000024614668800000000540712150000000005407208720000000044073206599999999840758292999999997407340940000000014074793200000000240727935999999998407207224071429600000000140740531000000004407158840000000044073346299999999940717015999999999407114879999999964072123399999999751571047999999999462005199999999944616197900000000446163672461684029999999984615091399999999846144597000000003461538140000000044614714799999999846196372000000006461795439999999954618862461417500000000044614947400000000146154512000000008461738370000000045705400351607314999999998516228359999999945160178499999999851588288000000002516037595158359400000000251629823999999997516063755158430199999999751578151999999999515817969999999985157392300000000651583717516256990000000035162398699999999862459967999999995624427409999999936242884300000000162443779000000008624320079999999946249732500000000462479226249354599999999362469498999999997624440259999999996242921800000000462450745999999997624320990000000016247293299999999962477679999999998624792300000000014752019599999999741804055417956679999999994180532100000000641801297000000002418242379999999964182397100000000241822955000000004418170779999999994179679941807789418292884183701799999999741820086000000005418053920000000014179696100000000146151128999999997
size
Elapsed Time (s)
call memcpy benchmark
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788899091929394959697989910010110210310410510610710810911011111211311411511611711811912012112212312412512612712812913013113213313413513613713813914014114214314414514614714814915015115215315415515615715815916002222912344275684802217887746190683202256260612613225602232334693108404302227336267046007302223417005944492402227843462618588302225329995439953022242250675648090221893426686683440222312396197396070221914689036011302231863514639080802224407448060690102228502734849364702225399274377102303333934053239018303328863188832842403334373712001688103363667446162412803338498309886923603327990047470301703338263319466416903335101108988007803334590883136565603328633342169613503333422177933667033334694809630605033344140971716163033311657046356435033383764264937810333662214760615866753326219934798872482990444299507728658220776103257374840270535695974641864958924776820208341692293762213262084310197751909795944808281243318510036708856760570415841738503611443949942882505146203946502837590844030817298671434610432651685768559579180668500448962317552856188456447311033965479890257068387704726295646556864532676916418055658738291164628196643751795997569267462996082704274691116169375305316914334994513846672741759432544484722033287701566817451935161276129172956708113064099736320004467274997548653819338486748032732555161131600253173252702336231562633181628363080591503856886383345138539043663052240964999449637764239524421496460299030885322865408218127200435662282187478088696617195168662875470005440041425473684150940511084566918356980365426369969451879906417078570274893504441798422023266326513586963596862315332496391531104354007356268154929546137903763535975401822095269873954629073200573703553250697506497785600128013447306256694571606985411566300868783472576001811848172079258659318506752056593553439672494686000192058811508760677712026113948368469037123310014465816445356027464545948084244744703169258286039647620196002881023470681131038921574766493464012699482352719539061254883162357075675349339497049681729493817569961279952311078789537335512378553045777355179055648608259552403667273093468528977000546868143282341100588507939456773521350605410391360521146184154327316390315242096638678918747415410892838252374210527033071794142580914860372179431496620330337164369560821042029843689259553153432462969300972589264527497704605483645782253341689199463070276297856954683840750231089729572219121213768-013024170415581204-013038442823616014-013072528810275874-013028983466301447-013139717579912435-013074723083930939-013124118448967514-013023791018397635-013027537654282573-013026189115487552-013075811776531843-013038659808634656-013110980398503136-013035330377184295-013071944068452823011470746038168711013001918115359845013002912167835179012958857557869230129957881450424750129166369988617640130985290708048650129763451673847510129850129652770101302937815883937801295189516001431501289468995981952701328518920731875901293262046376468801303032393524739013037289959908827011835437438594409
size
B difference from A
D3D12Multithreading
TitleThrottle=10000
(less is better)
Draws102512345678910111213141516003621621621621622405089820359281438405774647887323944904534967555875992803448965977318212401858823529411763979807177289769715E-284454007530930575E-235439137134052334E-2-32165146423427826E-2-7225034514496087E-2-010040160642570273-020410580339518369-020190023752969122-02398190045248868Draws102501234567891011121314151600896251419916698211657703874049177222484324324324323273267080745341583089275993467610233814523184601928337762237762237752806435267291613728982874935132328300213106020245052966209081309397529672564034856088297145122918318762953684210526315530376242945444769
NumContexts
cpu time difference from NumContexts1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
Excel Table | |||
Threads | Time (s) | ||
0 | 23 | ||
1 | 29 | ||
2 | 6 | ||
3 | 3 | ||
4 | 4 | ||
5 | 4 | ||
6 | 4 | ||
7 | 3 | ||
8 | 2 | ||
9 | 2 | ||
10 | 1 | ||
11 | 1 | ||
12 | 1 | ||
13 | 1 | ||
14 | 1 | ||
15 | 5 | ||
16 | 11 |
2017 The International | 2015 The Frankfurt Major | |||||||||||||
DX9 | DX11 | Vulkan | DX9 | DX11 | Vulkan | |||||||||
1 | 486 | 540 | 532 | 521 | 589 | 590 | ||||||||
2 | 518 | 543 | 547 | 538 | 588 | 602 | ||||||||
3 | 517 | 545 | 549 | 539 | 588 | 600 | ||||||||
4 | 518 | 544 | 551 | 539 | 589 | 602 | ||||||||
average last 3 | 518 | 544 | 549 | 539 | 589 | 601 | ||||||||
5 | 6 | 9 | 12 |
NumContext | Draws1025 | Draws10250 | NumContext | Draws1025 | Draws10250 | |||||||
1 | 02016 | 15024 | 1 | 0 | 0 | |||||||
2 | 0148 | 07923 | 2 | 36 | 90 | |||||||
3 | 01336 | 05653 | 3 | 51 | 166 | |||||||
4 | 01278 | 04625 | 4 | 58 | 225 | |||||||
5 | 01387 | 04025 | 5 | 45 | 273 | |||||||
6 | 01499 | 03674 | 6 | 34 | 309 | |||||||
7 | 017 | 03429 | 7 | 19 | 338 | |||||||
8 | 01867 | 03432 | 8 | 8 | 338 | |||||||
9 | 01859 | 03947 | 9 | 8 | 281 | |||||||
10 | 01947 | 03854 | 10 | 4 | 290 | |||||||
11 | 02083 | 03754 | 11 | -3 | 300 | |||||||
12 | 02173 | 03788 | 12 | -7 | 297 | |||||||
13 | 02241 | 03787 | 13 | -10 | 297 | |||||||
14 | 02533 | 03783 | 14 | -20 | 297 | |||||||
15 | 02526 | 03800 | 15 | -20 | 295 | |||||||
16 | 02652 | 03721 | 16 | -24 | 304 |
before | after | before | after | diff | ||||||||
29849 | 24421 | 1 | 30 | 24 | 22 | |||||||
29845 | 24428 | 2 | 30 | 24 | 22 | |||||||
29932 | 24421 | 3 | 30 | 24 | 23 | |||||||
29873 | 24421 | 4 | 30 | 24 | 22 | |||||||
29862 | 24422 | 5 | 30 | 24 | 22 | |||||||
29861 | 24429 | 6 | 30 | 24 | 22 | |||||||
29871 | 24429 | 7 | 30 | 24 | 22 | |||||||
29858 | 24423 | 8 | 30 | 24 | 22 | |||||||
29852 | 24420 | 9 | 30 | 24 | 22 | |||||||
29846 | 24426 | 10 | 30 | 24 | 22 | |||||||
29853 | 24423 | 11 | 30 | 24 | 22 | |||||||
29849 | 24428 | 12 | 30 | 24 | 22 | |||||||
29887 | 24433 | 13 | 30 | 24 | 22 | |||||||
29864 | 24430 | 14 | 30 | 24 | 22 | |||||||
29875 | 24431 | 15 | 30 | 24 | 22 | |||||||
29860 | 24425 | 16 | 30 | 24 | 22 | |||||||
21708 | 16280 | 17 | 22 | 16 | 33 | |||||||
21707 | 16286 | 18 | 22 | 16 | 33 | |||||||
21712 | 16283 | 19 | 22 | 16 | 33 | |||||||
21775 | 16294 | 20 | 22 | 16 | 34 | |||||||
21717 | 16281 | 21 | 22 | 16 | 33 | |||||||
21716 | 16293 | 22 | 22 | 16 | 33 | |||||||
21724 | 16287 | 23 | 22 | 16 | 33 | |||||||
21713 | 16282 | 24 | 22 | 16 | 33 | |||||||
21709 | 16280 | 25 | 22 | 16 | 33 | |||||||
21706 | 16285 | 26 | 22 | 16 | 33 | |||||||
21711 | 16283 | 27 | 22 | 16 | 33 | |||||||
21709 | 16281 | 28 | 22 | 16 | 33 | |||||||
21719 | 16288 | 29 | 22 | 16 | 33 | |||||||
21718 | 16291 | 30 | 22 | 16 | 33 | |||||||
21724 | 16287 | 31 | 22 | 16 | 33 | |||||||
21715 | 16282 | 32 | 22 | 16 | 33 | |||||||
168363 | 21715 | 33 | 168 | 22 | 675 | |||||||
179157 | 21720 | 34 | 179 | 22 | 725 | |||||||
180123 | 21737 | 35 | 180 | 22 | 729 | |||||||
182127 | 21747 | 36 | 182 | 22 | 737 | |||||||
183919 | 21729 | 37 | 184 | 22 | 746 | |||||||
188688 | 21733 | 38 | 189 | 22 | 768 | |||||||
187310 | 21724 | 39 | 187 | 22 | 762 | |||||||
190087 | 21719 | 40 | 190 | 22 | 775 | |||||||
198130 | 21714 | 41 | 198 | 22 | 812 | |||||||
207815 | 21721 | 42 | 208 | 22 | 857 | |||||||
206384 | 21716 | 43 | 206 | 22 | 850 | |||||||
200902 | 21718 | 44 | 201 | 22 | 825 | |||||||
203728 | 21729 | 45 | 204 | 22 | 838 | |||||||
210253 | 21740 | 46 | 210 | 22 | 867 | |||||||
207963 | 21715 | 47 | 208 | 22 | 858 | |||||||
211864 | 35284 | 48 | 212 | 35 | 500 | |||||||
214627 | 29857 | 49 | 215 | 30 | 619 | |||||||
225410 | 29864 | 50 | 225 | 30 | 655 | |||||||
240285 | 29859 | 51 | 240 | 30 | 705 | |||||||
222620 | 29869 | 52 | 223 | 30 | 645 | |||||||
226604 | 29866 | 53 | 227 | 30 | 659 | |||||||
228387 | 29879 | 54 | 228 | 30 | 664 | |||||||
231367 | 29868 | 55 | 231 | 30 | 675 | |||||||
236242 | 29862 | 56 | 236 | 30 | 691 | |||||||
236291 | 29856 | 57 | 236 | 30 | 691 | |||||||
247092 | 29863 | 58 | 247 | 30 | 727 | |||||||
245454 | 29859 | 59 | 245 | 30 | 722 | |||||||
252575 | 29884 | 60 | 253 | 30 | 745 | |||||||
247868 | 29879 | 61 | 248 | 30 | 730 | |||||||
249813 | 29871 | 62 | 250 | 30 | 736 | |||||||
255344 | 29870 | 63 | 255 | 30 | 755 | |||||||
236307 | 40720 | 64 | 236 | 41 | 480 | |||||||
247089 | 35286 | 65 | 247 | 35 | 600 | |||||||
255220 | 35292 | 66 | 255 | 35 | 623 | |||||||
257962 | 35298 | 67 | 258 | 35 | 631 | |||||||
260628 | 35299 | 68 | 261 | 35 | 638 | |||||||
257883 | 35301 | 69 | 258 | 35 | 631 | |||||||
260526 | 35313 | 70 | 261 | 35 | 638 | |||||||
263334 | 35298 | 71 | 263 | 35 | 646 | |||||||
266150 | 35295 | 72 | 266 | 35 | 654 | |||||||
268966 | 35284 | 73 | 269 | 35 | 662 | |||||||
268824 | 35292 | 74 | 269 | 35 | 662 | |||||||
282311 | 35286 | 75 | 282 | 35 | 700 | |||||||
276917 | 35314 | 76 | 277 | 35 | 684 | |||||||
279611 | 35312 | 77 | 280 | 35 | 692 | |||||||
282355 | 35308 | 78 | 282 | 35 | 700 | |||||||
285150 | 35297 | 79 | 285 | 35 | 708 | |||||||
239033 | 46147 | 80 | 239 | 46 | 418 | |||||||
249804 | 40712 | 81 | 250 | 41 | 514 | |||||||
257865 | 40721 | 82 | 258 | 41 | 533 | |||||||
260715 | 40732 | 83 | 261 | 41 | 540 | |||||||
263355 | 40758 | 84 | 263 | 41 | 546 | |||||||
260772 | 40734 | 85 | 261 | 41 | 540 | |||||||
263350 | 40748 | 86 | 263 | 41 | 546 | |||||||
266056 | 40728 | 87 | 266 | 41 | 553 | |||||||
268762 | 40721 | 88 | 269 | 41 | 560 | |||||||
271542 | 40714 | 89 | 272 | 41 | 567 | |||||||
271455 | 40741 | 90 | 271 | 41 | 566 | |||||||
285085 | 40716 | 91 | 285 | 41 | 600 | |||||||
279673 | 40733 | 92 | 280 | 41 | 587 | |||||||
282394 | 40717 | 93 | 282 | 41 | 594 | |||||||
284988 | 40711 | 94 | 285 | 41 | 600 | |||||||
287808 | 40721 | 95 | 288 | 41 | 607 | |||||||
241594 | 51571 | 96 | 242 | 52 | 368 | |||||||
252524 | 46201 | 97 | 253 | 46 | 447 | |||||||
260606 | 46162 | 98 | 261 | 46 | 465 | |||||||
263279 | 46164 | 99 | 263 | 46 | 470 | |||||||
266023 | 46168 | 100 | 266 | 46 | 476 | |||||||
263375 | 46151 | 101 | 263 | 46 | 471 | |||||||
266093 | 46145 | 102 | 266 | 46 | 477 | |||||||
268778 | 46154 | 103 | 269 | 46 | 482 | |||||||
271491 | 46147 | 104 | 271 | 46 | 488 | |||||||
274127 | 46196 | 105 | 274 | 46 | 493 | |||||||
274222 | 46180 | 106 | 274 | 46 | 494 | |||||||
287806 | 46189 | 107 | 288 | 46 | 523 | |||||||
282562 | 46142 | 108 | 283 | 46 | 512 | |||||||
285160 | 46149 | 109 | 285 | 46 | 518 | |||||||
288021 | 46155 | 110 | 288 | 46 | 524 | |||||||
290423 | 46174 | 111 | 290 | 46 | 529 | |||||||
244325 | 57054 | 112 | 244 | 57 | 328 | |||||||
255233 | 51607 | 113 | 255 | 52 | 395 | |||||||
263478 | 51623 | 114 | 263 | 52 | 410 | |||||||
265972 | 51602 | 115 | 266 | 52 | 415 | |||||||
268758 | 51588 | 116 | 269 | 52 | 421 | |||||||
265971 | 51604 | 117 | 266 | 52 | 415 | |||||||
268778 | 51584 | 118 | 269 | 52 | 421 | |||||||
271474 | 51630 | 119 | 271 | 52 | 426 | |||||||
274286 | 51606 | 120 | 274 | 52 | 431 | |||||||
276985 | 51584 | 121 | 277 | 52 | 437 | |||||||
276919 | 51578 | 122 | 277 | 52 | 437 | |||||||
290390 | 51582 | 123 | 290 | 52 | 463 | |||||||
285075 | 51574 | 124 | 285 | 52 | 453 | |||||||
287746 | 51584 | 125 | 288 | 52 | 458 | |||||||
290689 | 51626 | 126 | 291 | 52 | 463 | |||||||
293423 | 51624 | 127 | 293 | 52 | 468 | |||||||
247168 | 62460 | 128 | 247 | 62 | 296 | |||||||
54310 | 62443 | 129 | 54 | 62 | -13 | |||||||
54289 | 62429 | 130 | 54 | 62 | -13 | |||||||
54281 | 62444 | 131 | 54 | 62 | -13 | |||||||
54298 | 62432 | 132 | 54 | 62 | -13 | |||||||
54285 | 62497 | 133 | 54 | 62 | -13 | |||||||
54310 | 62479 | 134 | 54 | 62 | -13 | |||||||
54292 | 62494 | 135 | 54 | 62 | -13 | |||||||
54334 | 62469 | 136 | 54 | 62 | -13 | |||||||
54309 | 62444 | 137 | 54 | 62 | -13 | |||||||
54297 | 62429 | 138 | 54 | 62 | -13 | |||||||
54285 | 62451 | 139 | 54 | 62 | -13 | |||||||
54292 | 62432 | 140 | 54 | 62 | -13 | |||||||
54282 | 62473 | 141 | 54 | 62 | -13 | |||||||
54334 | 62478 | 142 | 54 | 62 | -13 | |||||||
54312 | 62479 | 143 | 54 | 62 | -13 | |||||||
52971 | 47520 | 144 | 53 | 48 | 11 | |||||||
47239 | 41804 | 145 | 47 | 42 | 13 | |||||||
47230 | 41796 | 146 | 47 | 42 | 13 | |||||||
47223 | 41805 | 147 | 47 | 42 | 13 | |||||||
47234 | 41801 | 148 | 47 | 42 | 13 | |||||||
47227 | 41824 | 149 | 47 | 42 | 13 | |||||||
47302 | 41824 | 150 | 47 | 42 | 13 | |||||||
47250 | 41823 | 151 | 47 | 42 | 13 | |||||||
47247 | 41817 | 152 | 47 | 42 | 13 | |||||||
47243 | 41797 | 153 | 47 | 42 | 13 | |||||||
47223 | 41808 | 154 | 47 | 42 | 13 | |||||||
47223 | 41829 | 155 | 47 | 42 | 13 | |||||||
47395 | 41837 | 156 | 47 | 42 | 13 | |||||||
47229 | 41820 | 157 | 47 | 42 | 13 | |||||||
47253 | 41805 | 158 | 47 | 42 | 13 | |||||||
47246 | 41797 | 159 | 47 | 42 | 13 | |||||||
51613 | 46151 | 160 | 52 | 46 | 12 |
run | Release x86 | ReleaseSSE2 x86 | Release x64 | ReleaseSSE2 x64 | |||||
Average | 52 | 44 | 43 | 39 | |||||
1 | 52 | 44 | 43 | 39 | |||||
2 | 52 | 44 | 43 | 39 | |||||
3 | 52 | 44 | 43 | 39 | |||||
4 | 52 | 44 | 43 | 39 | |||||
diff | 18 | 21 | 33 | ||||||
label | +18 | +21 | +33 | ||||||
1994 - Pink Floydd - Wish You Were Here (Digital Remaster) (UPC-A 724382975021) |
before | after | binary | seconds | diffference | ||||||
3878270 | 468667 | Before | 365 | |||||||
3579369 | 468613 | After | 47 | +679 | ||||||
3617797 | 468606 | |||||||||
3585746 | 468740 | |||||||||
4053311 | 468798 | |||||||||
3620973 | 468666 | |||||||||
3512279 | 468835 | |||||||||
3615476 | 468786 | |||||||||
3507937 | 468740 | |||||||||
3562341 | 468730 |
before | after | binary | seconds | diffference | ||||||
1007610284 | 476582356 | Before | 104 | |||||||
1024247928 | 448045378 | After | 47 | +123 | ||||||
1052145299 | 466659369 | |||||||||
1056735958 | 464907149 | |||||||||
1024368829 | 474053478 | |||||||||
1066633156 | 461674963 | |||||||||
1058203029 | 462514829 | |||||||||
1041118097 | 49289895 | |||||||||
1055392649 | 475898095 | |||||||||
1051172582 | 461760799 |
before | after | binary | seconds | diffference | ||||||
3608651906 | 267431402 | Before | 304 | |||||||
2242053937 | 273707055 | After | 27 | +1018 | ||||||
3042886884 | 271746015 | |||||||||
2609372553 | 269953529 | |||||||||
3237432947 | 269730537 | |||||||||
2991803304 | 2725437 | |||||||||
3665558054 | 273832348 | |||||||||
3077696925 | 272580411 | |||||||||
2374288763 | 271854875 | |||||||||
3504976333 | 272329958 |
L1D | L2U | L3U | |||||
AMD Ryzen 7 1800X | 4 | 17 | 40 | ||||
AMD Ryzen 7 2700X | 4 | 12 | 35 |
World of Warcraftv81 DX12(higher is better) | fps | diff | |||
gxMT Off | 4268 | ||||
gxMT On | 5819 | +36 | |||
Ryzen 7 2700X + GTX1080(graphics10 liquidgood Boralus Harbor) | |||||
DOTA 2version 3359 gameplay 721B(higher is better) | fps | diff | |||
DX9 | 538500972007 | ||||
DX11 | 589473747244 | +9 | |||
VK Beta | 601682313467 | +12 | |||
Ryzen 5 2400G | |||||
Testing conducted as of 2182019 System configuration AMD Ryzentrade 5 2400G Processor 2x8GB DDR4-3200 (14-14-14-34) Radeontrade RX Vega 11 (driver 1911 Jan20) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolutionResults may varyDOTA2 version 3359 gameplay 721BSteam Beta Built Feb 15 2019 at 150334 | |||||
AudacityLAME MP3 Encode x86(lower is better) | seconds | diff | |||
v399 non-SSE2 | 170 | ||||
v3100 SSE2 | 82 | +107 | |||
Ryzen 7 2700X |
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201927
bull CPU clocks bull Ret instbull IPC bull Retired Branch Instructions PTI bull Retired Branch Instructions Mispredictedbull Data Cache Accesses PTI bull Demand Data Cache Miss bull Data Cache Miss bull Data Cache Refills DRAM PTI bull Data Cache Refills CCX PTI bull Data Cache Refills L2 PTI bull Misalign Loads PTI bull ALUTokenStall PTI bull CacheableLocks PTI bull StliOther PTI bull WcbFull PTI
ASSESS PERFORMANCE (EXTENDED) EXAMPLE
Analyze issues using hardware event based sampling
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201928
PERFORMANCE COUNTER DOMAINS
FP floating point
LS loadstore
ICBP instruction cache and branch prediction
EX (SC) integer ALU amp AGU execution and
scheduling
L2
DE instruction decode dispatch microcode sequencer amp micro-op cache
L3 DF Data Fabric
UMC Unified
Memory Controller
(NDA only)
IOHC IO Hub
Controller
(NDA only)
rdpmc SMN inout
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201929
bull Disabling power management features may reduce variation during AB testingbull BIOS Settings
bull ldquoZenrdquo Common Optionsbull Core Performance Boost = Disablebull Global C-state Control = Disable
bull OS Power Options Choose Power Planbull High Performance = selected
bull AMD Ryzentrade Master Utilitybull Control Mode = Manualbull All Cores = Enabledbull Set a reasonable frequency amp voltage such as P0 custom default
bull Set core clock gt base clock to disable boost on ldquoRaven Ridgerdquo processorsbull Note SMU may still reduce frequency if application exceeds power current thermal limits
POWER MANAGEMENT
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
OPTIMIZATIONS AND LESSONS LEARNED
30
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201931
bull Use general guidancebull Use best practices when counting coresbull Build commands lists in parallelbull Use best practices with spinlocksbull Avoid memcpy amp memset regressionbull Avoid too many non-temporal streamsbull Avoid false sharing
AGENDA
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
USE GENERAL GUIDANCE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201933
Generate better code for current and past processors
USE THE LATEST VISUAL STUDIO COMPILER
Year Visual Studio Changes AMD Product Family (implicit)2019 Additional SIMD instrinsics optimizations including constant-folding and arithmetic simplifications
Build throughput improvements New -Ob3 inlining option Memcpy amp Memset optimizationsldquoPinnacle Ridgerdquo
2017 Improved code generation of loops Support for automatic vectorization of division of constant integers better identification of memset patterns Added Cmake support Added faster database engine Improved STL amp NET optimizations New Qspectre option
ldquoSummit Ridgerdquo
2015 Improved autovectorization amp scalar optimizations Faster build times with LTCGincremental Added assembly optimized memset amp memcpy using ERMS amp SSE2
ldquoKaverirdquo ldquoGodavarirdquo
2013 Improved inline Improved auto-vectorization Improved ISO C99 language and library ldquoVisherardquo
2012 Added autovectorization Optimized container memory sizes ldquoBulldozerrdquo
2010 Added nullptr keyword Replaced VCBuild with MSBuild
2008 Tuned for Intel Core microarchitecture Improved cpuidex amp intrinsics Added Qfast_transcendentals amp STLCLR library Faster build times with MP amp Managed incremental builds
ldquoGreyhoundrdquo
2005 Added x64 native compiler ldquoK8rdquo
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201934
bull Binary compiled after applying platformx64 shows higher performancebull ConfigurationReleaseSSE2 uses SSE2 intrinsics for
some functionsbull See httplamesourceforgenet
bull Performance of binary compiled with Microsoft Visual Studio 2017 v1596
bull Testing done by Kenneth Mitchell February 9 2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
bull WAV file 448MB 444 minutes
COMPILE FOR PLATFORMX64
Release x86
ReleaseSSE2
x86
Release x64
ReleaseSSE2
x64seconds 52 44 43 39
+18 +21 +33
0102030405060
Elap
sed
Tim
e (s
)
Configuration Platform
LAME 3100WAV to MP3 Encode
(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull AMD ldquoPinnacle Ridgerdquo processors will throw an ILLEGAL_INSTRUCTION exception if AVX512 is used
bull Do not use FMA4 instructions if they are not enumerated by CPUID
bull More commonly AMD ldquoGreyhoundrdquo amp ldquoLlanordquo processors will throw an ILLEGAL_INSTRUCTION exception if SSSE3 is used
bull ISPC httpsgithubcomispcispcbull x86-x64 issue fixed November 21 2018bull CPU_Pentium4 issue fixed July 12 2018
bull Masked Occlusion Culling httpsgithubcomGameTechDevMaskedOcclusionCulling
bull pabsd issue fixed November 2nd 2018bull Bullet3 httpsgithubcombulletphysicsbullet3
bull _xgetbv issue fixed June 20th 2014
bull Windowsreg 10 x64 requires SSE2 amp PrefetchWbull See httpsdocsmicrosoftcomen-uswindows-
hardwaredesignminimumminimum-hardware-requirements-overview
bull The AMD64 Instruction Set Architecture includes SSE2 amp PrefetchW
bull Windows 7 x86 does NOT require SSE2 amp PrefetchW
35
0000gt g(79c838) Illegal instruction - code c000001d (first chance)(79c838) Illegal instruction - code c000001d ( second chance )modfoobar+0xb6300007ff7`25a78663 660f381ed4 pabsd xmm2xmm4
TEST CPUID BEFORE CALLING INSTRUCTIONS
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
USE BEST PRACTICES WHEN COUNTING CORES
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
This advice is specific to AMD processors and is not general guidance for all processor vendorsDWORD getDefaultThreadCount()
DWORD cores logicalgetProcessorCount(cores logical)DWORD count = logicalchar vendor[13]getCpuidVendor(vendor)if (0 == strcmp(vendor AuthenticAMD))
if (0x15 == getCpuidFamily()) AMD Bulldozer family microarchitecturecount = logical
else count = cores
return count
USE ALL PHYSICAL CORESbull This advice is specific to AMD
processors and is not general guidance for all processor vendors
bull Generally applications show SMT benefits and use of all logical processors is recommendedbull But games often suffer from SMT
contention on the main threadbull One strategy to reduce this contention is
to create threads based on physical core count rather than logical processor count
bull Profile your applicationgame to determine the ideal thread count
bull AMD ldquoBulldozerrdquo is not a SMT designbull Avoid core clampingbull See httpsgpuopencomcpu-core-
count-detection-windows
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201938
Incorrect enumexe output for Ryzentrade 7 2700X
Physical Processor ID 0 has 16 coresas logical processors 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Number of active logical processors 16Number of active physical processors 1Number of cores per processor 16Number of threads per processor core 1
--------------------------------------------------------------------------------Correct output
Number of cores per processor 8Number of threads per processor core 2
AVOID THE 2009 AMD PROCESSOR AND CORE ENUMERATION CODE SAMPLEbull Deprecated enumc code sample from
June 30 2009 does not work properly on AMD family 17h processorsbull Number of cores per processor amp
Number of threads per processor return incorrect values
bull Formerly hosted at httpdeveloperamdcomresourcesdocumentation-articlesarticles-whitepapersprocessor-and-core-enumeration-using-cpuid
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-counts
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DWORD lps = 64 logical processorsDWORD_PTR p = 0xffffffffffffffff process maskint b = 32 bit index
if 0 BAD int x = p signed narrowingint count = 0while (x = 0) never zerowhile (x gt 0) false
count += (x amp 1)x gtgt= 1 fill bits with sign
printf(popcnt = in count) 0printf(mask b[i] = 0x016llxn b (1 ltlt b)) undefined 0
else GOOD DWORD_PTR x = pint count = 0while (x = 0)
count += (x amp 1)x gtgt= 1
printf(popcnt = in count)printf(mask b[i] = 0x016llxn b (1ULL ltlt b)) 0x0000000100000000endif
AVOID SIGNED NARROWING AFFINITY MASKS
bull Avoid signed narrowing affinity masksbull Otherwise the application may crash
or exhibit unexpected behaviorbull By default an application is
constrained to a single group a static set of up to 64 logical processors
bull Right Shifts For signed numbers the sign bit is used to fill the vacated bit positions
bull Left Shifts If you left-shift a signed number so that the sign bit is affected the result is undefined
bull See httpsmsdnmicrosoftcomen-uslibrary336xbhczaspx
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
char buffer[0x1000] bad assumption char buffer = NULLDWORD len = 0if (FALSE == GetLogicalProcessorInformationEx(
RelationAll (PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)buffer amplen)) if (GetLastError() == ERROR_INSUFFICIENT_BUFFER)
printf(len = 0xxn len)buffer = (char)malloc(len)if (GetLogicalProcessorInformationEx(
RelationAll(PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX)bufferamplen))
free(buffer)
GET GETLOGICALPROCESSORINFORMATIONEXBUFFER LENGTH AT RUNTIMEbull Get
PSYSTEM_LOGICAL_PROCESSOR_INFORMATION_EX buffer length at runtime
bull Otherwise the application may crash if an insufficiently sized buffer was created at compile time
bull See httpsgithubcomGPUOpen-LibrariesAndSDKscpu-core-countsblobmasterwindowsThreadCount-Win7cpp
bull WinDbg Commandsbull bp kernelbaseGetLogicalProcessorInformation
printf FOUND GetLogicalProcessorInformationBuffer Length 0xxnpoi(rdx)ldquo
bull bp kernelbaseGetLogicalProcessorInformationExprintf FOUND GetLogicalProcessorInformationEx Buffer Length 0xxnpoi(r8)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201941
bull It may not be practical for developers to fix application code bugs related to high processor count long after release bull Costs may include restoring source from an archive buying more servers hiring more engineers
coding testing and releasingbull Although application updates are strongly preferred AMD and the Microsoft Windows Compatibility team
collaborated on a last-resort Application Compatibility Shimbull The ProcessorCountLie shim passes limited CPU topology to the application preventing processor count
bugs in application code such as signed narrowing affinity or insufficient buffer bull The updated shim in Windowsreg 10 version 19H1 hooks processor related APIs such as GetSystemInfo
and GetLogicalProcessorInformationbull See the Compatibility Administrator
bull Contact your Microsoft representative or report a problem using the Microsoft Feedback Hub
PROCESSORCOUNTLIE SHIM
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
BUILD COMMANDS LISTS IN PARALLEL
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201943
bull Binary compiled using NumContexts WorkerThreads shows higher performance given sufficient Draw countbull See httpsgithubcomMicrosoftDirectX-Graphics-
SamplestreemasterSamplesDesktopD3D12Multithreadingbull Recommend NumContexts = min(cores-1 Draws250) bull Performance of binary compiled with Microsoftreg Visual Studio
2017 v1596bull Testing done by Kenneth Mitchell February 12 2019 on the
following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 (driver 1921 Feb4) AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
BUILD COMMAND LISTS IN PARALLEL
-500
50100150200250300350400
0 4 8 12 16
cpu
time
di
ffere
nce
from
N
umC
onte
xts
1
NumContexts
D3D12MultithreadingTitleThrottle=10000
(higher is better)Draws1025 Draws10250
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201944
bull Be sure to re-enable parallel rendering after using debug features before you shipbull See httpsdocsunrealenginecomen-USProgrammingRenderingParallelRendering
USE UE4 PARALLEL RENDERING
Command Recommended Valuerrhicmdusedeferredcontexts 1rrhicmduseparallelalgorithms 1rrhithreadenable 1rrhicmdbypass 0
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
USE BEST PRACTICES WITH SPINLOCKS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Spin Lock Best Practices
bull avoid lock prefix instructionsbull use the pause instructionbull test and test-and-setbull alignas(64) lock variable
bull or _declspec(align(64))bull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull ALUTokenStall PTI
bull gt= 3K Per Thousand Instructions is bad for top functions
46
namespace MyLock typedef unsigned LOCK PLOCKenum LOCK_IS_FREE = 0 LOCK_IS_TAKEN = 1
if 0 BAD void Lock(PLOCK pl)
while (LOCK_IS_TAKEN == _InterlockedCompareExchange( pl LOCK_IS_TAKEN LOCK_IS_FREE)) lock xchg cmp
else GOOD
void Lock(PLOCK pl) while ((LOCK_IS_TAKEN == pl) ||
(LOCK_IS_TAKEN==_InterlockedExchange(pl LOCK_IS_TAKEN)))_mm_pause()
endifvoid Unlock(PLOCK pl)
_InterlockedExchange(pl LOCK_IS_FREE)
alignas(64) MyLockLOCK gLock
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 30
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
47
Before AfterTime (s) 304 27
+10180
50
100
150
200
250
300
350
Elap
sed
Tim
e (s
)
binary
Use Best Practices With Spinlocks(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude intrinhinclude stdiohinclude windowshinclude ltchronogtinclude ltnumericgtinclude ltthreadgtdefine LEN 512
alignas(64) float b[LEN][4][4]alignas(64) float c[LEN][4][4]
DWORD WINAPI ThreadProcCallback(LPVOID data) MyLockLock(ampgLock) alignas(64) float a[LEN][4][4]stdfill((float)a (float)(a + LEN) 00f)float r = 00for (size_t iter = 0 iter lt 100000 iter++)
for (int m = 0 m lt LEN m++)for (int i = 0 i lt 4 i++)
for (int j = 0 j lt 4 j++)for (int k = 0 k lt 4 k++)
a[m][i][j] += b[m][i][k] c[m][k][j]r += stdaccumulate((float)a
(float)(a + LEN) 00f) printf(result fn r) MyLockUnlock(ampgLock)return 0
48
int main(int argc char argv[]) using namespace stdchronofloat b0 = (argc gt 1) strtof(argv[1] NULL) 10ffloat c0 = (argc gt 2) strtof(argv[2] NULL) 20f stdfill((float)b (float)(b + LEN) b0)stdfill((float)c (float)(c + LEN) c0)int num_threads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[num_threads]high_resolution_clocktime_point t0 =
high_resolution_clocknow()
for (size_t i = 0 i lt num_threads ++i) threads[i] = CreateThread(NULL
0 ThreadProcCallback NULL 0 NULL) WaitForMultipleObjects(num_threads
threads TRUE INFINITE)
high_resolution_clocktime_point t1 = high_resolution_clocknow()
durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (milliseconds) lfn 10000 time_spancount())
delete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~4K stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~0 stalls per 1000
instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
51
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
52
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
53
AFTER (GOOD)
00000001400010C0
cmp dword ptr [gLock3IA]1
je 00000001400010D9
mov eax1
xchg eaxdword ptr [gLock3IA]
cmp eax1
jne 00000001400010DD
00000001400010D9
pause
jmp 00000001400010C0
00000001400010DD
BEFORE (BAD)
00000001400010C1
xor eaxeax
lock cmpxchg dword ptr [gLock3IA]ecx
cmp eaxecx
je 00000001400010C1
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
INTERLOCKED INTRINSIC FUNCTIONSintrinsic generates instructions including_InterlockedAnd lock cmpxchg_interlockedbittestandreset lock btr_interlockedbittestandset lock bts_InterlockedCompareExchange lock cmpxchg_InterlockedCompareExchange128 lock cmpxchg16b_InterlockedCompareExchangePointer lock cmpxchg_InterlockedDecrement lock dec_InterlockedExchangeAdd lock add_InterlockedIncrement lock inc_InterlockedOr lock cmpxchg_InterlockedXor lock cmpxchg_InterlockedExchange xchg_InterlockedExchangePointer xchg
54
Preferred instructions without lock prefix
Instructions generated may vary depending on compiler and optimization flags
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
AVOID MEMCPY amp MEMSETREGRESSION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201956
bull Currently ldquoCWindowsSystem32vcruntime140dllrdquo includes a memcpy amp memset regression that can affect AMD family 17h and later processors only
bull Code generated by MSVS 2015 2017 or 2019 may link to vcruntime140dllbull However we have found that this regression occurs under very specific
circumstances where length gt 32 ampamp length lt= 128 Bytes ampamp length is unknown at compile time
bull Meanwhile we propose a workaround
PREFACE
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201957
Workaroundbull Modify Assembly Files (Copying Recommended)
bull Comment out lines shown in yellowbull CProgram Files (x86)Microsoft Visual
Studio2017CommunityVCToolsMSVC141627023crtsrcx64
bull memcpyasmbull memsetasm
bull Assemble Object Filesbull ml64exe -c memcpyasm memsetasm
bull Set Linker gt Input gt Additional Dependencies = memcpyobjmemsetobjhellip
Alternate Workaround
bull Copy vcruntime140dll from ldquoMicrosoft Visual Studio2019PreviewCommon7IDEVCvcpackagesrdquo to the application folder
WORKAROUND memcpyasm line 456XmmCopySmall bt __favor __FAVOR_SMSTRG check if string copy should be used jc memcpy_repmovsmovups xmm0 [rcx + rdx] load deferred bytesadd rcx 16sub r8 16
memsetasm line 93 Check if strings should be used cmp r8 128 is this a small set size lt= 128 ja XmmSet if large set use XMM set bt __favor __FAVOR_SMSTRG check if string set should be used jnc XmmSetSmall otherwise use a 16-byte block set jmp memset_repmovs memcpyasm amp memsetasm code is Copyright (c) Microsoft
Corporation All rights reserved
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying workaround shows
higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell February 11
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
58
0
5
10
15
20
25
30
35
0 16 32 48 64 80 96 112 128 144 160
Elap
sed
Tim
e (s
)
size
call memcpy benchmark(less is better)
before after
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltnumericgtinclude ltchronogtalignas(64) char a[32 1024]alignas(64) char b[32 1024]using namespace stdchrono
void work(int size size_t steps) high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
memcpy(b a size)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())
59
int main(int argc char argv[]) int j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 7size_t steps = (argc gt 3) atoll(argv[3]) 1000000000int size = (argc gt 4) atoi(argv[4]) 48srand(seed)for (int i = 0 i lt= sizeof(a) ++i )
a[i] = rand()256memset(b 0 sizeof(b))
work(size steps)
printf(a[i] = in j a[j])printf(b[i] = in j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
60
AFTER (WITH WORKAROUND) calls memcpy within executable with workaroundmemcpymov r11rcxmov r10rdxcmp r810hjbe 00000001400012F0cmp r820hjbe 00000001400012D0sub rdxrcxjae 00000001400012A4lea rax[r8+r10]cmp rcxraxjb 00000001400015D0cmp r880hjbe 0000000140001510hellip
BEFORE (WITHOUT WORKAROUND)
calls vcruntime140dllmemcpy
memcpy
jmp qword ptr [__imp_memcpy]
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
AVOID TOO MANY NON-TEMPORAL STREAMS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull Write Combining lines are not stored in the caches The Write Combining Buffer writes 64 byte lines to
memory when full The processor can gather writes from 8 different 64B cache lines (up to 7 from one hardware thread)
bull Avoid interleaving multiple Write Combining streams to different addresses use only one stream per hardware thread if possible While using multiple streams the hardware may close buffers before they are completely full thus leading to reduced performance
bull Profilingbull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profile
bull WcbFull PTI bull gt= 22 Per Thousand Instructions is bad for top functionsbull gt= 35 Per Thousand Instructions is very bad for top functions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying the workaround
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1596bull Testing done by Kenneth Mitchell January 31
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX 580 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
63
Before AfterTime (s) 104 47
+123
0
20
40
60
80
100
120
Elap
sed
Tim
e (s
)
binary
Avoid Too Many Non-Temporal Streams(less is better)
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltintrinhgtinclude ltnumericgtinclude ltchronogtdefine LEN 64000alignas(64) float a[LEN]alignas(64) float b[LEN]
void step(float dt) for (size_t i = 0 i lt LEN i += 8)
xyzwvxvyvyvw__m128 p1 = _mm_load_ps(ampa[(i + 0) LEN])__m128 v1 = _mm_load_ps(ampa[(i + 4) LEN])p1 = _mm_add_ps(p1 _mm_mul_ps(v1 _mm_load_ps1(ampdt)))_mm_stream_ps(ampa[(i + 0) LEN] p1)__m128 p2 = _mm_load_ps(ampb[(i + 0) LEN])__m128 v2 = _mm_load_ps(ampb[(i + 4) LEN])p2 = _mm_add_ps(p2 _mm_mul_ps(v2 _mm_load_ps1(ampdt)))
if 0 without workaround _mm_stream_ps(ampb[(i + 0) LEN] p2)
else with workaround _mm_store_ps(ampb[(i + 0) LEN] p2)
endif
64
int main(int argc char argv[]) using namespace stdchronoint j = (argc gt 1) atoi(argv[1]) 0int seed = (argc gt 2) atoi(argv[2]) 3size_t steps = (argc gt 3) atoll(argv[3]) 2000000float dt = (argc gt 4) (float)atof(argv[4]) 0001fsrand(seed)for (int i = 0 i lt LEN ++i)
a[i] = (float)rand() RAND_MAXb[i] = (float)rand() RAND_MAX
high_resolution_clocktime_point t0 =
high_resolution_clocknow()for (size_t i = 0 i lt steps i++)
step(dt)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span =
duration_castltdurationltdoublegtgt(t1 - t0)printf(time (milliseconds) lfn
10000 time_spancount())printf(a[i] = fn j a[j])printf(b[i] = fn j b[j])return EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFORE~23 WcbFull per 1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTER~3 WcbFull per
1000 instructions
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
67
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
68
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
69
AFTER (WITH WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movups xmmword ptr [rsi+r84+4640h]xmm0
BEFORE (WITHOUT WORKAROUND)
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+42E40h]
movntps xmmword ptr [rsi+r84+42E40h]xmm0
movups xmm0xmmword ptr [rsi+rcx4+4640h]
mulps xmm0xmm1
addps xmm0xmmword ptr [rsi+r84+4640h]
movntps xmmword ptr [rsi+r84+4640h]xmm0
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
AVOID FALSE SHARING
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SUMMARYbull False Sharing occurs when threads running on different processors each with a local cache modify
variables that exist in the same cache line This can reduce performance due to processor work required to maintain cache-coherencybull L3 is filled from L2 victims of all 4 cores within a CCX
bull L2 tags are duplicated in the L3 for fast cache transfers within a CCXbull And fast probe filtering for RyzentradeThreadrippertrade and Epyctrade
bull Use thread local rather than global or process shared databull Align and pad Thread parameters ndash especially synchronizations variablesbull Profiling
bull AMD uProf v20 Assess Performance (Extended) Event Based Sampling Profilebull Data Cache Refills CCX PTI
bull Minimize
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PERFORMANCEbull Binary compiled after applying best practices
shows higher performancebull Performance of binary compiled with Microsoft
Visual Studio 2017 v1593bull Testing done by Kenneth Mitchell February 5
2019 on the following system PC manufacturers may vary configurations yielding different results Results may vary based on driver versions used Test configuration AMD Ryzentrade 7 2700X Processor 2x8GB DDR4-3200 (16-18-18-36) Radeontrade RX Vega 64 AMD Ryzentrade Reference Motherboard Windowsreg 10 x64 build 1809 1920x1080 resolution BIOS core clock = 37GHz Core Performance Boost = Disable Global C-state Control = Disable
72
Before AfterTime (s) 365 47
+6790
50
100
150
200
250
300
350
400
Tim
e (s
)
binary
Avoid False Sharing
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
CODE SAMPLEinclude ltwindowshgtinclude ltchronogtinclude ltnumericgtinclude ltthreadgtusing namespace stdchronodefine NUM_ITER 2000000000int seed
if 0 4 bytes struct ThreadData unsigned long sum else 64 bytes struct alignas(64) ThreadData unsigned long sum endifDWORD WINAPI ThreadProcCallback(void param) ThreadData p = (ThreadData)paramsrand(seed)p-gtsum = 0for (int i = 0 i lt NUM_ITER i++)
p-gtsum += rand() 2return 0
73
int main(int argc char argv[]) seed = (argc gt 1) atoi(argv[1]) 3int numThreads = stdthreadhardware_concurrency()HANDLE threads = new HANDLE[numThreads]ThreadData a = new ThreadData[numThreads]high_resolution_clocktime_point t0 = high_resolution_clocknow()for (size_t i = 0 i lt numThreads ++i) threads[i] = CreateThread(NULL 0 ThreadProcCallback
(void)ampa[i] 0 NULL)WaitForMultipleObjects(numThreads threads TRUE INFINITE)high_resolution_clocktime_point t1 =
high_resolution_clocknow()durationltdoublegt time_span = duration_castltdurationltdoublegtgt(t1 - t0)
printf(time (ms) lfn 10000 time_spancount())for (size_t i = 0 i lt numThreads ++i)
printf(sum[llu] = lun i a[i]sum)delete[] adelete[] threadsreturn EXIT_SUCCESS
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING BEFOREMany refills from
CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
PROFILING AFTERFew refills from CCX
CPU clocks
Ret inst IPC Data Cache Accesses PTI
Demand Data Cache Miss
Data Cache Miss
Data Cache Refills DRAM PTI
Data Cache Refills CCX PTI
Data Cache Refills L2 PTI
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE BEFORE
76
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
SOURCE AFTER
77
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
DISASSEMBLY SNIPPET
78
AFTER (GOOD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r15+rdi8]rax
add rbp40h
inc rdi
cmp rdir14
jb 0000000140001180
BEFORE (BAD)
0000000140001180
mov qword ptr [rsp+28h]rsi
lea r8[ThreadProcCallbackYAKPEAXZ]
mov r9rbp
mov dword ptr [rsp+20h]esi
xor edxedx
xor ecxecx
call qword ptr [__imp_CreateThread]
mov qword ptr [r12+rdi8]rax
add rbp4
inc rdi
cmp rdir14
jb 0000000140001180
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
ROADMAP
79
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201980
CONTINUOUS INNOVATION LEADERSHIP
Roadmap subject to change
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
bull New Zen2 core microarchitecturebull New 7nm process technologybull Higher instructions-per-cycle (IPC)bull Doubled floating point width to 256-bitbull Doubled loadstore bandwidthbull PCIereg Gen4 Readybull AM4 Desktop Infrastructurebull See httpswwwamdcomeneventscesbull See httpswwwamdcomeneventsnext-horizon
3RD GENERATION AMD RYZENtrade DESKTOP PROCESSOR
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
QUESTIONS
82
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
GIVEAWAY
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
THANK YOU
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
KENNETH MITCHELLISV GAME ENGINEERING RADEON TECHNOLOGY GROUP
CPU DEV TECH TEAM LEAD
KennethMitchellamdcomkenmitchellken
85
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 201986
The information presented in this document is for informational purposes only and may contain technical inaccuracies omissions and typographical errors
The information contained herein is subject to change and may be rendered inaccurate for many reasons including but not limited to product and roadmap changes component and motherboard version changes new model andor product releases product differences between differing manufacturers software changes BIOS flashes firmware upgrades or the like AMD assumes no obligation to update or otherwise correct or revise this information However AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes
AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION
AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT INDIRECT SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES
ATTRIBUTIONcopy 2019 Advanced Micro Devices Inc All rights reserved AMD Ryzentrade and the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices Inc in the United States andor other jurisdictions Vulkan and the Vulkan logo are registered trademarks of Khronos Group Inc Microsoft and Windows are registered trademarks of Microsoft Corporation PCIe is a registered trademark of PCI-SIG Other names are for informational purposes only and may be trademarks of their respective owners
DISCLAIMER AND ATTRIBUTION
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019
| GDC19 | AMD Ryzentrade Processor Software Optimization | March 20th 2019