34

FusionSim Simulator

Embed Size (px)

DESCRIPTION

About CPU and GPU Simulator

Citation preview

Page 1: FusionSim Simulator
Page 2: FusionSim Simulator

FUSIONSIM: A Cycle-Accurate CPU + GPU System Simulator

Vitaly Zakharenko, Andreas Moshovos

University of Toronto

Tor Aamodt

University of British Columbia

With support from AMD Canada, Ontario Centres of Excellence and

National Science and Engineering Council of Canada.

Page 3: FusionSim Simulator

3 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSIONSIM:

A CYCLE-ACCURATE CPU + GPU

SYSTEM SIMULATOR

Page 4: FusionSim Simulator

4 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

WHAT IS FUSIONSIM?

Detailed timing simulator of a complete system with an x86 CPU and a GPU

– Fused or Discrete Systems

FusionSim’s features:

– x86 out-of-order CPU + CUDA-capable GPU

Operate concurrently

– Detailed timing models for all components

Models reflect modern hardware

Enables performance modeling:

Fused vs. Discrete

“What if” scenarios

Page 5: FusionSim Simulator

5 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

AGENDA

TWO FLAVOURS OF FUSIONSIM

Structure & Functionality of Discrete FusionSim

– Models a discrete system:

Distinct CPU and GPU chips

Separate CPU and GPU DRAM

Structure & Functionality of Fused FusionSim

– Models a fused system:

Same CPU and GPU chip

Shared CPU and GPU DRAM

Partly shared memory hierarchy

Page 6: FusionSim Simulator

6 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

AGENDA

FUSION: WHICH BENCHMARK BENEFITS?

Analytical speed-up model

– Greater speed up for

Small benchmark input data size

Many kernel invocations (large cumulative latency overhead )

High benchmark kernel throughput

Long time spent in the GPU code relative to the x86 code

Simulation speed-up results of Rodinia

– Range: 1.05x to 9.72x

– A closer look at a fusion-friendly benchmark

Large speed-up (up to x9.72) for small problem sizes

Smaller (x1.8) speed-up for medium problem sizes

– Dependence on latency overhead and kernel throughput

COPY

KERNELKERNEL

TOTAL

TOTALGPU

dataG

1

KERNEL

TOTALdata

TOTAL

TOTAL KERNEL

Page 7: FusionSim Simulator

7 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

AGENDA

FUSION: WHICH SYSTEM FACTORS AFFECT SPEED-UP?

Kernel spawn latency

– From GPU API kernel launch request until actual kernel execution

– Simulation: order-of-magnitude reduction is important

CPU/GPU memory coherence

– Simulation: performance loss is minor

less than 2 % for most Rodinia benchmarks

Page 8: FusionSim Simulator

8 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM:

STRUCTURE

CPU from PTLSim: www.ptlsim.org

GPU of GPGPU-Sim: www.gpgpu-sim.org

CPU caches of MARSSx86: www.marss86.org

Page 9: FusionSim Simulator

9 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM:

COMPONENT FEATURES

CPU: PTLSIM

– Fast x86: 200KIPS/sec (isolated)

– Out-of-Order

– Micro-op architecture

– Cycle-accurate

– Modular & detailed memory

hierarchy model

GPU: GPGPU-SIM

– OpenCL/CUDA capable

Currently only CUDA

– High correlation vs. Nvidia GT200 and Fermi

NoC

– Detailed & configurable

DRAM

– Detailed

Page 10: FusionSim Simulator

10 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM:

START-UP AND MEMORY LAYOUT

Input: standard Linux CUDA benchmark executable

Benchmarks process is created

Simulator is injected into virtual memory space

– Private stack

– Private heap & heap management

– Invisible to the benchmark process

Simulator executes benchmark’s code:

– x86 code on PTLsim

– PTX code on GPGPU-Sim

Benchmarks process communicates with FusionSim

via a single page accessible by both

(pink)

(green)

(yellow)

Replacement

of the standard

dynamic library

Page 11: FusionSim Simulator

11 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM:

MAIN SIMULATION LOOP

Single simulation loop:

– Each loop cycle == tick of a virtual ‘common’ clock

– x GPU_MULTIPLIER = GPU_FREQ

– x CPU_MULTIPLIER = CPU_FREQ

WHILE (1) {

FOR GPU_MULTIPLIER ITERATIONS DO {

GPU_CYCLE()

}

FOR CPU_MULTIPLIER ITERATIONS DO {

CPU_CYCLE()

}

}

Page 12: FusionSim Simulator

12 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM:

EXAMPLE GPU API CALL

Virtual PTLsim CPU executes x86 code

– call to API cudaMemcpyAsync(a, b, c) is reached

On next GPU cycle, FusionSim

– Identifies pending API call

– Enqueues the task for the GPU

– Decides whether to block the CPU (synchronous) or

to let the CPU proceed (asynchronous)

Page 13: FusionSim Simulator

13 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DISCRETE FUSIONSIM:

SIMULATOR FEATURES

Correctly models ordering and overlap in time of

– asynchronous & synchronous operations

– memory transfers

– CUDA events

– Kernel computations

– CPU processing

Models duration of all CUDA stream operations

Simple and powerful mechanism for management of configuration and simulation output files

Page 14: FusionSim Simulator

14 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM:

STRUCTURE

Processing Cluster is replaced by a CPU

CUDA ‘global’ memory address space is shared

– No more memory transfers from/to device

DRAM

Last Level Cache size is adjusted (increased)

– GPU’s L2 is also CPU’s L3

CPU: L1 and L2 private caches

Page 15: FusionSim Simulator

15 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM:

A CHALLENGE WITH EXISTING CPU + GPU MEMORY SPACES

CUDA ‘global’ memory space

– Shared between CPU & GPU

– Accessible by both using the same virtual

address

– Cached in LLC and mapped to DRAM

CUDA ‘local’ memory space

– Private to the GPU

– Inaccessible by the CPU

– Cached in LLC and mapped to DRAM

How do we model these?

Page 16: FusionSim Simulator

16 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM:

SIMULATING THE CPU AND GPU MEMORY SPACES

Common Virtual memory

– Used by both the CPU and the GPU

– Slightly different virtual memory spaces

‘Generic’ virtual address

– Used by GPU

– For the same location X accessible by the CPU

Generic_virt_addr = virt_addr + 0x40000000

32-bit virtual address space (4GBytes)

– FusionSim does not simulate OS kernel code => top-most 1GByte addresses unused

Page 17: FusionSim Simulator

17 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM:

MEMORY SPACE: WHERE AND WHAT

CPU

– Uses CPU virtual address

GPU

– Uses ‘generic’ virtual address

Caches

– Physically-addressed

– CPU adjusts virtual address to ‘generic’ and

translates it to physical

– GPU directly translates ‘generic’ to physical

MMU

– Same MMU for both the CPU and the GPU

Uses CPU

virtual address

Uses generic

virtual address

Physical

address

Page 18: FusionSim Simulator

18 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM

MEMORY COHERENCE

Shared CUDA ‘global’ address space

Same block from ‘global’ space

– Cached in private CPU L1 $

– Cached in private GPU L1 $

Potential coherence problem

First-cut solution: Flushing caches to LLC

– Interesting area for exploration

Page 19: FusionSim Simulator

19 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM

MEMORY COHERENCE: IMPLEMENTATION

CPU side:

– Selective flushing of private caches

– cudaSelectivelyFlush(address, size)

prior to every kernel invocation

for every region of memory accessed by the kernel

GPU side:

– GPGPU-Sim already flushes the caches

Page 20: FusionSim Simulator

20 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED FUSIONSIM

CHANGES TO GPU API

No need for device memory allocation API

– cudaMalloc()

– cudaFree()

No memory transfers to/from device DRAM

– cudaMemCpy()

– cudaMemset()

Additional API function:

– cudaSelectivelyFlush(address, size)

Page 21: FusionSim Simulator

21 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED VS DISCRETE: EXPERIMENTAL METHODOLOGY

Rodinia

– benchmark suite for heterogeneous computing

Discrete system modeled by Discrete FusionSim

– Unmodified Rodinia

Fused system modeled by Fused FusionSim

– Modified Rodinia:

No cudaMalloc()/cudaFree()

No memory transfers

Added cudaSelectivelyFlush()

Data input generation is excluded from time

measurement

Page 22: FusionSim Simulator

22 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED VS DISCRETE: RELATIVE PERFORMANCE

Rodinia benchmarks

Two baseline discrete systems:

– 10 µsec kernel spawn latency

– 100 usec kernel spawn latency

Speed-up varies:

– From x1.05

nn, 10 usec

– Up to x9.72

gaus_4, 10 usec

FUSED is better

Page 23: FusionSim Simulator

23 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED SYSTEM: KERNEL SPAWN LATENCY

One baseline discrete system:

– 10 µsec kernel spawn latency

Different fused systems:

– 0.1 µsec kernel spawn latency

– 1 µsec kernel spawn latency

– 10 µsec kernel spawn latency

Simulations show:

– Reduction of the latency to 1 µsec is

important

– Further reduction below 1 µsec is NOT

important

FUSED is better

Page 24: FusionSim Simulator

24 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSED SYSTEM: COHERENCE OVERHEAD

Two fused systems:

– Incoherent vs. coherent

– kernel spawn latency is 0.1 usec in both

systems

Simulations show:

– Minor performance loss

Less then 2% for most benchmarks

5% for bfs_small

SMALLER is better

Page 25: FusionSim Simulator

25 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSION: WHICH BENCHMARK BENEFITS?

ANALYTICAL MODEL

Semantics

– Total cumulative latency

– Kernel throughput

– Benchmark data input size

Greater speed up for

– Small benchmark input data size

Small

– Many kernel invocations and memory transfers

Large

– High benchmark kernel throughput

Large

– Long time spent in the GPU code relative to the

CPU code

KERNEL

TOTALdata

COPY

KERNELKERNEL

TOTAL

TOTALGPU

dataG

1

TOTAL

TOTAL

TOTALdata

KERNEL

Page 26: FusionSim Simulator

26 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSION: WHICH BENCHMARK BENEFITS?

TWO SCENARIOS

Large

Significant

Insignificant

COPY

CHUNKKERNELCHUNKKERNEL

CHUNK

CHUNKGPU

datadata

dataG

)()(1

Small

Insignificant

Significant

CHUNKdataCHUNKdata

COPY

KERNEL

KERNEL

CHUNK

CHUNK

data

KERNEL

CHUNK

CHUNK

data

COPY

KERNEL

Page 27: FusionSim Simulator

27 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSION: WHICH BENCHMARK BENEFITS?

INPUT DATA SIZE

Greater problem size smaller benefit from fusion

FU

SE

D is

bett

er

INPUT SIZE IS GREATER

Rodinia BFS Rodinia Gaussian

Page 28: FusionSim Simulator

28 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSION: WHICH BENCHMARK BENEFITS?

LATENCY OVERHEAD

Comparison between two benchmarks:

– Rodinia Gaussian

Speed-up 9.72x

– Rodinia NN

Speed-up 1.05x

– Why?

100 times more kernel spawns for Gaussian

10 times more memory copies for Gaussian

Normalized latency overhead TOTAL

TOTAL

data

)10( _COPYMEMKERNEL

TOTAL

TOTAL nndata

COPY

KERNELKERNEL

TOTAL

TOTALGPU

dataG

1

Page 29: FusionSim Simulator

29 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSION: WHICH BENCHMARK BENEFITS?

KERNEL THROUGHPUT

Comparison between two benchmarks:

– Rodinia BFS

Speed-up 4.28x

– Rodinia NN

Speed-up 1.05x

– Why?

100 times greater throughput

for BFS

Kernel throughput

COPY

KERNELKERNEL

TOTAL

TOTALGPU

dataG

1

KERNEL

KERNEL

Page 30: FusionSim Simulator

30 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

FUSIONSIM WEBSITE:

DOCUMENTATION & SOURCE CODE

www.fusionsim.ca

– Discrete FusionSim & Fused FusionSim

– Source code

– Documentation

– Google group for collaborators

Page 31: FusionSim Simulator

31 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DERIVATION OF THE ANALYTICAL SPEED-UP MODEL

SYMBOL MEANINGS

Page 32: FusionSim Simulator

32 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DERIVATION OF THE ANALYTICAL SPEED-UP MODEL

PART 1 The kernel can be modeled as a channel of throughput )( KERKER data as the actual throughput

will vary depending on KERdata . As KERdata increases, KER saturates.

Most of existing CUDA applications (including all the considered Rodinia benchmarks)

exhibit the following computation pattern:

For such applications GPUt is described by the following:

KER

KERTOTKERGPU

datant

, where KER is the kernel data throughput and

TOT is the total latency per iteration resulting

from both the memory transfers and the kernel spawn.

For CUDA applications that do not utilize multiple concurrent CUDA streams the total latency

per single computation iteration TOT is comprised of the time spent transferring the data to or

from the device and the kernel spawn latency:

KS

COPY

KERCOPYTOT

data

2

The above expression holds true for all the considered Rodinia benchmarks.

For iterations do:

1. Copy the input data from the host to the device 2. Launch kernel on the data 3. Copy the results from the device to the host

Page 33: FusionSim Simulator

33 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

DERIVATION OF THE ANALYTICAL SPEED-UP MODEL

PART 2

COPY

KERNELKERNEL

TOTAL

TOTALKER

TOT

KERTOTGPU

datadata

nG

11

Since on fused systems this latency reduces to KSKSTOT , the time

GPUt of executing

the CUDA code on the fused system is given by

KER

KERKER

KER

KERKSKERGPU

datan

datant

The speed-up of the CUDA code is given by

1

KER

KERTOT

GPU

GPUGPU

datat

tG

Since KERTOTKER ndatadata / , we obtain

Here TOTAL is the total latency accumulated during the benchmark execution and comprising

the all kernel spawn and memory transfer latencies.

Please also note that the throughput KER of a benchmark kernel increases with KERdata for

small KERdata values and saturates to a constant for large KERdata values. The throughput

saturates when the input data size is sufficient for maximum possible warp scheduler

occupancy for the given benchmark kernel. For benchmarks utilizing CUDA streams and

overlapping kernel execution with data transfers the latency is bounded from above, i.e.:

KS

COPY

KERCOPYTOT

data

2

This results in a smaller speed-up GPUG for such benchmarks. Applying the Amdahl’s law we

get an expression for the total benchmarks speed-up TOTG :

GPU

GPUGPUCPU

TOT GG

G

%%

1

Page 34: FusionSim Simulator

34 | FusionSim: A Cycle Accurate CPU + GPU Simulator | June 13th, 2012

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions

and typographical errors.

The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited

to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product

differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. There is no

obligation to update or otherwise correct or revise this information. However, we reserve the right to revise this information and to

make changes from time to time to the content hereof without obligation to notify any person of such revisions or changes.

NO REPRESENTATIONS OR WARRANTIES ARE MADE WITH RESPECT TO THE CONTENTS HEREOF AND NO

RESPONSIBILITY IS ASSUMED FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS

INFORMATION.

ALL IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE ARE EXPRESSLY

DISCLAIMED. IN NO EVENT WILL ANY LIABILITY TO ANY PERSON BE INCURRED FOR ANY DIRECT, INDIRECT, SPECIAL

OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN

IF EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES.

AMD, the AMD arrow logo, and combinations thereof are trademarks of Advanced Micro Devices, Inc. All other names used in

this presentation are for informational purposes only and may be trademarks of their respective owners.

The contents of this presentation were provided by individual(s) and/or company listed on the title page. The information and

opinions presented in this presentation may not represent AMD’s positions, strategies or opinions. Unless explicitly stated, AMD is

not responsible for the content herein and no endorsements are implied.