FusionSim Simulator

FUSIONSIM: A Cycle-Accurate CPU + GPU System Simulator

Vitaly Zakharenko, Andreas Moshovos

University of Toronto

Tor Aamodt

University of British Columbia

With support from AMD Canada, Ontario Centres of Excellence and

National Science and Engineering Council of Canada.

3 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSIONSIM:

A CYCLE-ACCURATE CPU + GPU

SYSTEM SIMULATOR


WHAT IS FUSIONSIM?

Detailed timing simulator of a complete system with an x86 CPU and a GPU

– Fused or Discrete Systems

FusionSim’s features:

– x86 out-of-order CPU + CUDA-capable GPU

Operate concurrently

– Detailed timing models for all components

Models reflect modern hardware

Enables performance modeling:

Fused vs. Discrete

“What if” scenarios


AGENDA

TWO FLAVOURS OF FUSIONSIM

Structure & Functionality of Discrete FusionSim

– Models a discrete system:

Distinct CPU and GPU chips

Separate CPU and GPU DRAM

Structure & Functionality of Fused FusionSim

– Models a fused system:

Same CPU and GPU chip

Shared CPU and GPU DRAM

Partly shared memory hierarchy


AGENDA

FUSION: WHICH BENCHMARK BENEFITS?

Analytical speed-up model

– Greater speed up for

Small benchmark input data size

Many kernel invocations (large cumulative latency overhead )

High benchmark kernel throughput

Long time spent in the GPU code relative to the x86 code

Simulation speed-up results of Rodinia

– Range: 1.05x to 9.72x

– A closer look at a fusion-friendly benchmark

Large speed-up (up to x9.72) for small problem sizes

Smaller (x1.8) speed-up for medium problem sizes

– Dependence on latency overhead and kernel throughput

COPY

KERNELKERNEL

TOTAL

TOTALGPU

dataG

1

KERNEL

TOTALdata

TOTAL

TOTAL KERNEL


AGENDA

FUSION: WHICH SYSTEM FACTORS AFFECT SPEED-UP?

Kernel spawn latency

– From GPU API kernel launch request until actual kernel execution

– Simulation: order-of-magnitude reduction is important

CPU/GPU memory coherence

– Simulation: performance loss is minor

less than 2 % for most Rodinia benchmarks


DISCRETE FUSIONSIM:

STRUCTURE

CPU from PTLSim: www.ptlsim.org

GPU of GPGPU-Sim: www.gpgpu-sim.org

CPU caches of MARSSx86: www.marss86.org


DISCRETE FUSIONSIM:

COMPONENT FEATURES

CPU: PTLSIM

– Fast x86: 200KIPS/sec (isolated)

– Out-of-Order

– Micro-op architecture

– Cycle-accurate

– Modular & detailed memory

hierarchy model

GPU: GPGPU-SIM

– OpenCL/CUDA capable

Currently only CUDA

– High correlation vs. Nvidia GT200 and Fermi

NoC

– Detailed & configurable

DRAM

– Detailed


DISCRETE FUSIONSIM:

START-UP AND MEMORY LAYOUT

Input: standard Linux CUDA benchmark executable

Benchmarks process is created

Simulator is injected into virtual memory space

– Private stack

– Private heap & heap management

– Invisible to the benchmark process

Simulator executes benchmark’s code:

– x86 code on PTLsim

– PTX code on GPGPU-Sim

Benchmarks process communicates with FusionSim

via a single page accessible by both

(pink)

(green)

(yellow)

Replacement

of the standard

dynamic library


DISCRETE FUSIONSIM:

MAIN SIMULATION LOOP

Single simulation loop:

– Each loop cycle == tick of a virtual ‘common’ clock

– x GPU_MULTIPLIER = GPU_FREQ

– x CPU_MULTIPLIER = CPU_FREQ

WHILE (1) {

FOR GPU_MULTIPLIER ITERATIONS DO {

GPU_CYCLE()

}

FOR CPU_MULTIPLIER ITERATIONS DO {

CPU_CYCLE()

}

}


DISCRETE FUSIONSIM:

EXAMPLE GPU API CALL

Virtual PTLsim CPU executes x86 code

– call to API cudaMemcpyAsync(a, b, c) is reached

On next GPU cycle, FusionSim

– Identifies pending API call

– Enqueues the task for the GPU

– Decides whether to block the CPU (synchronous) or

to let the CPU proceed (asynchronous)


DISCRETE FUSIONSIM:

SIMULATOR FEATURES

Correctly models ordering and overlap in time of

– asynchronous & synchronous operations

– memory transfers

– CUDA events

– Kernel computations

– CPU processing

Models duration of all CUDA stream operations

Simple and powerful mechanism for management of configuration and simulation output files


FUSED FUSIONSIM:

STRUCTURE

Processing Cluster is replaced by a CPU

CUDA ‘global’ memory address space is shared

– No more memory transfers from/to device

DRAM

Last Level Cache size is adjusted (increased)

– GPU’s L2 is also CPU’s L3

CPU: L1 and L2 private caches


FUSED FUSIONSIM:

A CHALLENGE WITH EXISTING CPU + GPU MEMORY SPACES

CUDA ‘global’ memory space

– Shared between CPU & GPU

– Accessible by both using the same virtual

address

– Cached in LLC and mapped to DRAM

CUDA ‘local’ memory space

– Private to the GPU

– Inaccessible by the CPU

– Cached in LLC and mapped to DRAM

How do we model these?


FUSED FUSIONSIM:

SIMULATING THE CPU AND GPU MEMORY SPACES

Common Virtual memory

– Used by both the CPU and the GPU

– Slightly different virtual memory spaces

‘Generic’ virtual address

– Used by GPU

– For the same location X accessible by the CPU

Generic_virt_addr = virt_addr + 0x40000000

32-bit virtual address space (4GBytes)

– FusionSim does not simulate OS kernel code => top-most 1GByte addresses unused


FUSED FUSIONSIM:

MEMORY SPACE: WHERE AND WHAT

CPU

– Uses CPU virtual address

GPU

– Uses ‘generic’ virtual address

Caches

– Physically-addressed

– CPU adjusts virtual address to ‘generic’ and

translates it to physical

– GPU directly translates ‘generic’ to physical

MMU

– Same MMU for both the CPU and the GPU

Uses CPU

virtual address

Uses generic

virtual address

Physical

address


FUSED FUSIONSIM

MEMORY COHERENCE

Shared CUDA ‘global’ address space

Same block from ‘global’ space

– Cached in private CPU L1 $

– Cached in private GPU L1 $

Potential coherence problem

First-cut solution: Flushing caches to LLC

– Interesting area for exploration


FUSED FUSIONSIM

MEMORY COHERENCE: IMPLEMENTATION

CPU side:

– Selective flushing of private caches

– cudaSelectivelyFlush(address, size)

prior to every kernel invocation

for every region of memory accessed by the kernel

GPU side:

– GPGPU-Sim already flushes the caches


FUSED FUSIONSIM

CHANGES TO GPU API

No need for device memory allocation API

– cudaMalloc()

– cudaFree()

No memory transfers to/from device DRAM

– cudaMemCpy()

– cudaMemset()

Additional API function:

– cudaSelectivelyFlush(address, size)


FUSED VS DISCRETE: EXPERIMENTAL METHODOLOGY

Rodinia

– benchmark suite for heterogeneous computing

Discrete system modeled by Discrete FusionSim

– Unmodified Rodinia

Fused system modeled by Fused FusionSim

– Modified Rodinia:

No cudaMalloc()/cudaFree()

No memory transfers

Added cudaSelectivelyFlush()

Data input generation is excluded from time

measurement


FUSED VS DISCRETE: RELATIVE PERFORMANCE

Rodinia benchmarks

Two baseline discrete systems:

– 10 µsec kernel spawn latency

– 100 usec kernel spawn latency

Speed-up varies:

– From x1.05

nn, 10 usec

– Up to x9.72

gaus_4, 10 usec

FUSED is better


FUSED SYSTEM: KERNEL SPAWN LATENCY

One baseline discrete system:


Different fused systems:

– 0.1 µsec kernel spawn latency



Simulations show:

– Reduction of the latency to 1 µsec is

important

– Further reduction below 1 µsec is NOT

important

FUSED is better


FUSED SYSTEM: COHERENCE OVERHEAD

Two fused systems:

– Incoherent vs. coherent

– kernel spawn latency is 0.1 usec in both

systems

Simulations show:

– Minor performance loss

Less then 2% for most benchmarks

5% for bfs_small

SMALLER is better



ANALYTICAL MODEL

Semantics

– Total cumulative latency

– Kernel throughput

– Benchmark data input size

Greater speed up for

– Small benchmark input data size

Small

– Many kernel invocations and memory transfers

Large

– High benchmark kernel throughput

Large

– Long time spent in the GPU code relative to the

CPU code

KERNEL

TOTALdata

COPY

KERNELKERNEL

TOTAL

TOTALGPU

dataG

1

TOTAL

TOTAL

TOTALdata

KERNEL



TWO SCENARIOS

Large

Significant

Insignificant

COPY

CHUNKKERNELCHUNKKERNEL

CHUNK

CHUNKGPU

datadata

dataG

)()(1

Small

Insignificant

Significant

CHUNKdataCHUNKdata

COPY

KERNEL

KERNEL

CHUNK

CHUNK

data

KERNEL

CHUNK

CHUNK

data

COPY

KERNEL



INPUT DATA SIZE

Greater problem size smaller benefit from fusion

FU

SE

D is

bett

er

INPUT SIZE IS GREATER

Rodinia BFS Rodinia Gaussian



LATENCY OVERHEAD

Comparison between two benchmarks:

– Rodinia Gaussian

Speed-up 9.72x

– Rodinia NN

Speed-up 1.05x

– Why?

100 times more kernel spawns for Gaussian

10 times more memory copies for Gaussian

Normalized latency overhead TOTAL

TOTAL

data

)10( _COPYMEMKERNEL

TOTAL

TOTAL nndata

COPY

KERNELKERNEL

TOTAL

TOTALGPU

dataG

1



KERNEL THROUGHPUT

Comparison between two benchmarks:

– Rodinia BFS

Speed-up 4.28x

– Rodinia NN

Speed-up 1.05x

– Why?

100 times greater throughput

for BFS

Kernel throughput

COPY

KERNELKERNEL

TOTAL

TOTALGPU

dataG

1

KERNEL

KERNEL


FUSIONSIM WEBSITE:

DOCUMENTATION & SOURCE CODE

www.fusionsim.ca

– Discrete FusionSim & Fused FusionSim

– Source code

– Documentation

– Google group for collaborators

http://www.fusionsim.ca/


DERIVATION OF THE ANALYTICAL SPEED-UP MODEL

SYMBOL MEANINGS



PART 1 The kernel can be modeled as a channel of throughput )( KERKER data as the actual throughput

will vary depending on KERdata . As KERdata increases, KER saturates.

Most of existing CUDA applications (including all the considered Rodinia benchmarks)

exhibit the following computation pattern:

For such applications GPUt is described by the following:

KER

KERTOTKERGPU

datant

, where KER is the kernel data throughput and

TOT is the total latency per iteration resulting

from both the memory transfers and the kernel spawn.

For CUDA applications that do not utilize multiple concurrent CUDA streams the total latency

per single computation iteration TOT is comprised of the time spent transferring the data to or

from the device and the kernel spawn latency:

KS

COPY

KERCOPYTOT

data

2

The above expression holds true for all the considered Rodinia benchmarks.

For iterations do:

1. Copy the input data from the host to the device 2. Launch kernel on the data 3. Copy the results from the device to the host



PART 2

COPY

KERNELKERNEL

TOTAL

TOTALKER

TOT

KERTOTGPU

datadata

nG

11

Since on fused systems this latency reduces to KSKSTOT , the time

GPUt of executing

the CUDA code on the fused system is given by

KER

KERKER

KER

KERKSKERGPU

datan

datant

The speed-up of the CUDA code is given by

1

KER

KERTOT

GPU

GPUGPU

datat

tG

Since KERTOTKER ndatadata / , we obtain

Here TOTAL is the total latency accumulated during the benchmark execution and comprising

the all kernel spawn and memory transfer latencies.

Please also note that the throughput KER of a benchmark kernel increases with KERdata for

small KERdata values and saturates to a constant for large KERdata values. The throughput

saturates when the input data size is sufficient for maximum possible warp scheduler

occupancy for the given benchmark kernel. For benchmarks utilizing CUDA streams and

overlapping kernel execution with data transfers the latency is bounded from above, i.e.:

KS

COPY

KERCOPYTOT

data

2

This results in a smaller speed-up GPUG for such benchmarks. Applying the Amdahl’s law we

get an expression for the total benchmarks speed-up TOTG :

GPU

GPUGPUCPU

TOT GG

G

%%

1


Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited

to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product

differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no

obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to

make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL

OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN

IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in

this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and

opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is

not responsible for the content herein and no endorsements are implied.

Documents

FusionSim Simulator