Upload
shingaridavesh93
View
38
Download
0
Tags:
Embed Size (px)
DESCRIPTION
About CPU and GPU Simulator
Citation preview
FUSIONSIM: A Cycle-Accurate CPU + GPU System Simulator
Vitaly Zakharenko, Andreas Moshovos
University of Toronto
Tor Aamodt
University of British Columbia
With support from AMD Canada, Ontario Centres of Excellence and
National Science and Engineering Council of Canada.
3 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSIONSIM:
A CYCLE-ACCURATE CPU + GPU
SYSTEM SIMULATOR
4 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
WHAT IS FUSIONSIM?
Detailed timing simulator of a complete system with an x86 CPU and a GPU
– Fused or Discrete Systems
FusionSim’s features:
– x86 out-of-order CPU + CUDA-capable GPU
Operate concurrently
– Detailed timing models for all components
Models reflect modern hardware
Enables performance modeling:
Fused vs. Discrete
“What if” scenarios
5 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
AGENDA
TWO FLAVOURS OF FUSIONSIM
Structure & Functionality of Discrete FusionSim
– Models a discrete system:
Distinct CPU and GPU chips
Separate CPU and GPU DRAM
Structure & Functionality of Fused FusionSim
– Models a fused system:
Same CPU and GPU chip
Shared CPU and GPU DRAM
Partly shared memory hierarchy
6 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
AGENDA
FUSION: WHICH BENCHMARK BENEFITS?
Analytical speed-up model
– Greater speed up for
Small benchmark input data size
Many kernel invocations (large cumulative latency overhead )
High benchmark kernel throughput
Long time spent in the GPU code relative to the x86 code
Simulation speed-up results of Rodinia
– Range: 1.05x to 9.72x
– A closer look at a fusion-friendly benchmark
Large speed-up (up to x9.72) for small problem sizes
Smaller (x1.8) speed-up for medium problem sizes
– Dependence on latency overhead and kernel throughput
COPY
KERNELKERNEL
TOTAL
TOTALGPU
dataG
1
KERNEL
TOTALdata
TOTAL
TOTAL KERNEL
7 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
AGENDA
FUSION: WHICH SYSTEM FACTORS AFFECT SPEED-UP?
Kernel spawn latency
– From GPU API kernel launch request until actual kernel execution
– Simulation: order-of-magnitude reduction is important
CPU/GPU memory coherence
– Simulation: performance loss is minor
less than 2 % for most Rodinia benchmarks
8 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM:
STRUCTURE
CPU from PTLSim: www.ptlsim.org
GPU of GPGPU-Sim: www.gpgpu-sim.org
CPU caches of MARSSx86: www.marss86.org
9 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM:
COMPONENT FEATURES
CPU: PTLSIM
– Fast x86: 200KIPS/sec (isolated)
– Out-of-Order
– Micro-op architecture
– Cycle-accurate
– Modular & detailed memory
hierarchy model
GPU: GPGPU-SIM
– OpenCL/CUDA capable
Currently only CUDA
– High correlation vs. Nvidia GT200 and Fermi
NoC
– Detailed & configurable
DRAM
– Detailed
10 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM:
START-UP AND MEMORY LAYOUT
Input: standard Linux CUDA benchmark executable
Benchmarks process is created
Simulator is injected into virtual memory space
– Private stack
– Private heap & heap management
– Invisible to the benchmark process
Simulator executes benchmark’s code:
– x86 code on PTLsim
– PTX code on GPGPU-Sim
Benchmarks process communicates with FusionSim
via a single page accessible by both
(pink)
(green)
(yellow)
Replacement
of the standard
dynamic library
11 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM:
MAIN SIMULATION LOOP
Single simulation loop:
– Each loop cycle == tick of a virtual ‘common’ clock
– x GPU_MULTIPLIER = GPU_FREQ
– x CPU_MULTIPLIER = CPU_FREQ
WHILE (1) {
FOR GPU_MULTIPLIER ITERATIONS DO {
GPU_CYCLE()
}
FOR CPU_MULTIPLIER ITERATIONS DO {
CPU_CYCLE()
}
}
12 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM:
EXAMPLE GPU API CALL
Virtual PTLsim CPU executes x86 code
– call to API cudaMemcpyAsync(a, b, c) is reached
On next GPU cycle, FusionSim
– Identifies pending API call
– Enqueues the task for the GPU
– Decides whether to block the CPU (synchronous) or
to let the CPU proceed (asynchronous)
13 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DISCRETE FUSIONSIM:
SIMULATOR FEATURES
Correctly models ordering and overlap in time of
– asynchronous & synchronous operations
– memory transfers
– CUDA events
– Kernel computations
– CPU processing
Models duration of all CUDA stream operations
Simple and powerful mechanism for management of configuration and simulation output files
14 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM:
STRUCTURE
Processing Cluster is replaced by a CPU
CUDA ‘global’ memory address space is shared
– No more memory transfers from/to device
DRAM
Last Level Cache size is adjusted (increased)
– GPU’s L2 is also CPU’s L3
CPU: L1 and L2 private caches
15 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM:
A CHALLENGE WITH EXISTING CPU + GPU MEMORY SPACES
CUDA ‘global’ memory space
– Shared between CPU & GPU
– Accessible by both using the same virtual
address
– Cached in LLC and mapped to DRAM
CUDA ‘local’ memory space
– Private to the GPU
– Inaccessible by the CPU
– Cached in LLC and mapped to DRAM
How do we model these?
16 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM:
SIMULATING THE CPU AND GPU MEMORY SPACES
Common Virtual memory
– Used by both the CPU and the GPU
– Slightly different virtual memory spaces
‘Generic’ virtual address
– Used by GPU
– For the same location X accessible by the CPU
Generic_virt_addr = virt_addr + 0x40000000
32-bit virtual address space (4GBytes)
– FusionSim does not simulate OS kernel code => top-most 1GByte addresses unused
17 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM:
MEMORY SPACE: WHERE AND WHAT
CPU
– Uses CPU virtual address
GPU
– Uses ‘generic’ virtual address
Caches
– Physically-addressed
– CPU adjusts virtual address to ‘generic’ and
translates it to physical
– GPU directly translates ‘generic’ to physical
MMU
– Same MMU for both the CPU and the GPU
Uses CPU
virtual address
Uses generic
virtual address
Physical
address
18 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM
MEMORY COHERENCE
Shared CUDA ‘global’ address space
Same block from ‘global’ space
– Cached in private CPU L1 $
– Cached in private GPU L1 $
Potential coherence problem
First-cut solution: Flushing caches to LLC
– Interesting area for exploration
19 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM
MEMORY COHERENCE: IMPLEMENTATION
CPU side:
– Selective flushing of private caches
– cudaSelectivelyFlush(address, size)
prior to every kernel invocation
for every region of memory accessed by the kernel
GPU side:
– GPGPU-Sim already flushes the caches
20 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED FUSIONSIM
CHANGES TO GPU API
No need for device memory allocation API
– cudaMalloc()
– cudaFree()
No memory transfers to/from device DRAM
– cudaMemCpy()
– cudaMemset()
Additional API function:
– cudaSelectivelyFlush(address, size)
21 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED VS DISCRETE: EXPERIMENTAL METHODOLOGY
Rodinia
– benchmark suite for heterogeneous computing
Discrete system modeled by Discrete FusionSim
– Unmodified Rodinia
Fused system modeled by Fused FusionSim
– Modified Rodinia:
No cudaMalloc()/cudaFree()
No memory transfers
Added cudaSelectivelyFlush()
Data input generation is excluded from time
measurement
22 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED VS DISCRETE: RELATIVE PERFORMANCE
Rodinia benchmarks
Two baseline discrete systems:
– 10 µsec kernel spawn latency
– 100 usec kernel spawn latency
Speed-up varies:
– From x1.05
nn, 10 usec
– Up to x9.72
gaus_4, 10 usec
FUSED is better
23 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED SYSTEM: KERNEL SPAWN LATENCY
One baseline discrete system:
– 10 µsec kernel spawn latency
Different fused systems:
– 0.1 µsec kernel spawn latency
– 1 µsec kernel spawn latency
– 10 µsec kernel spawn latency
Simulations show:
– Reduction of the latency to 1 µsec is
important
– Further reduction below 1 µsec is NOT
important
FUSED is better
24 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSED SYSTEM: COHERENCE OVERHEAD
Two fused systems:
– Incoherent vs. coherent
– kernel spawn latency is 0.1 usec in both
systems
Simulations show:
– Minor performance loss
Less then 2% for most benchmarks
5% for bfs_small
SMALLER is better
25 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSION: WHICH BENCHMARK BENEFITS?
ANALYTICAL MODEL
Semantics
– Total cumulative latency
– Kernel throughput
– Benchmark data input size
Greater speed up for
– Small benchmark input data size
Small
– Many kernel invocations and memory transfers
Large
– High benchmark kernel throughput
Large
– Long time spent in the GPU code relative to the
CPU code
KERNEL
TOTALdata
COPY
KERNELKERNEL
TOTAL
TOTALGPU
dataG
1
TOTAL
TOTAL
TOTALdata
KERNEL
26 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSION: WHICH BENCHMARK BENEFITS?
TWO SCENARIOS
Large
Significant
Insignificant
COPY
CHUNKKERNELCHUNKKERNEL
CHUNK
CHUNKGPU
datadata
dataG
)()(1
Small
Insignificant
Significant
CHUNKdataCHUNKdata
COPY
KERNEL
KERNEL
CHUNK
CHUNK
data
KERNEL
CHUNK
CHUNK
data
COPY
KERNEL
27 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSION: WHICH BENCHMARK BENEFITS?
INPUT DATA SIZE
Greater problem size smaller benefit from fusion
FU
SE
D is
bett
er
INPUT SIZE IS GREATER
Rodinia BFS Rodinia Gaussian
28 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSION: WHICH BENCHMARK BENEFITS?
LATENCY OVERHEAD
Comparison between two benchmarks:
– Rodinia Gaussian
Speed-up 9.72x
– Rodinia NN
Speed-up 1.05x
– Why?
100 times more kernel spawns for Gaussian
10 times more memory copies for Gaussian
Normalized latency overhead TOTAL
TOTAL
data
)10( _COPYMEMKERNEL
TOTAL
TOTAL nndata
COPY
KERNELKERNEL
TOTAL
TOTALGPU
dataG
1
29 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSION: WHICH BENCHMARK BENEFITS?
KERNEL THROUGHPUT
Comparison between two benchmarks:
– Rodinia BFS
Speed-up 4.28x
– Rodinia NN
Speed-up 1.05x
– Why?
100 times greater throughput
for BFS
Kernel throughput
COPY
KERNELKERNEL
TOTAL
TOTALGPU
dataG
1
KERNEL
KERNEL
30 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
FUSIONSIM WEBSITE:
DOCUMENTATION & SOURCE CODE
www.fusionsim.ca
– Discrete FusionSim & Fused FusionSim
– Source code
– Documentation
– Google group for collaborators
31 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DERIVATION OF THE ANALYTICAL SPEED-UP MODEL
SYMBOL MEANINGS
32 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DERIVATION OF THE ANALYTICAL SPEED-UP MODEL
PART 1 The kernel can be modeled as a channel of throughput )( KERKER data as the actual throughput
will vary depending on KERdata . As KERdata increases, KER saturates.
Most of existing CUDA applications (including all the considered Rodinia benchmarks)
exhibit the following computation pattern:
For such applications GPUt is described by the following:
KER
KERTOTKERGPU
datant
, where KER is the kernel data throughput and
TOT is the total latency per iteration resulting
from both the memory transfers and the kernel spawn.
For CUDA applications that do not utilize multiple concurrent CUDA streams the total latency
per single computation iteration TOT is comprised of the time spent transferring the data to or
from the device and the kernel spawn latency:
KS
COPY
KERCOPYTOT
data
2
The above expression holds true for all the considered Rodinia benchmarks.
For iterations do:
1. Copy the input data from the host to the device 2. Launch kernel on the data 3. Copy the results from the device to the host
33 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
DERIVATION OF THE ANALYTICAL SPEED-UP MODEL
PART 2
COPY
KERNELKERNEL
TOTAL
TOTALKER
TOT
KERTOTGPU
datadata
nG
11
Since on fused systems this latency reduces to KSKSTOT , the time
GPUt of executing
the CUDA code on the fused system is given by
KER
KERKER
KER
KERKSKERGPU
datan
datant
The speed-up of the CUDA code is given by
1
KER
KERTOT
GPU
GPUGPU
datat
tG
Since KERTOTKER ndatadata / , we obtain
Here TOTAL is the total latency accumulated during the benchmark execution and comprising
the all kernel spawn and memory transfer latencies.
Please also note that the throughput KER of a benchmark kernel increases with KERdata for
small KERdata values and saturates to a constant for large KERdata values. The throughput
saturates when the input data size is sufficient for maximum possible warp scheduler
occupancy for the given benchmark kernel. For benchmarks utilizing CUDA streams and
overlapping kernel execution with data transfers the latency is bounded from above, i.e.:
KS
COPY
KERCOPYTOT
data
2
This results in a smaller speed-up GPUG for such benchmarks. Applying the Amdahl’s law we
get an expression for the total benchmarks speed-up TOTG :
GPU
GPUGPUCPU
TOT GG
G
%%
1
34 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012
Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions
and typographical errors.
The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited
to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product
differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no
obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to
make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.
NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO
RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS
INFORMATION.
ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY
DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL
OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN
IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.
AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in
this presentation are for informational purposes only and may be trademarks of their respective owners.
The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and
opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is
not responsible for the content herein and no endorsements are implied.