The Multi2Sim Simulation Framework Multi2Sim Simulation Framework, PACT 2011 Tutorial 24 7. The Evergreen GPU Emulation ... The Multi2Sim Simulation Framework, PACT 2011 Tutorial 26

Conference title 1

The Multi2Sim Simulation Framework

A CPU-GPU Modelfor Heterogeneous Computing

www.multi2sim.org

Rafael UbalDavid R. Kaeli

Northeastern UniversityBoston, MA

The Multi2Sim Simulation Framework, PACT 2011 Tutorial 2

Outline

1. Introduction

First Block – The x86 CPU Simulation

2. The x86 CPU Emulation3. The x86 CPU Architectural Simulation4. The Memory Hierarchy5. Benchmarks and Simulations

Second Block – The AMD Evergreen GPU Simulation

6. The OpenCL Programming Model7. The AMD Evergreen GPU Emulation8. The AMD Evergreen GPU Architectural Simulation9. Benchmarks and Simulations

10. Conclusions and Future work


1. IntroductionMotivation

• Limitations of existing CPU simulators– Such as SimpleScalar, Simics, SSMT, M-Sim, SMTSim,

M5, ...– Full-system vs. application-only simulation.– Free, open-source.– Architectural simulation accuracy.– Alpha/PISA architectures → cross-compilers.– Integrated system.

• Current simulation needs– Based on current processor market.– Heterogeneous CPU-GPU environments.– Tool for evaluation of new architectural proposals.– Simulation of a GPU ISA.

• Existing GPU simulation approaches– Barra: NVIDIA Telsa ISA.– Ocelot: PTX intermediate language simulator.– No architectural simulation.– No emulation of AMD ISAs.– Not capable of heterogeneous simulation.


1. IntroductionMulti2Sim Background

• Multi2Sim 3.x version series, 2011 (x86+Evergreen)

Superscalar pipelineOut-of-order execution, branch prediction, trace cache, etc.

MultithreadingFine-grain, coarse-grain and simultaneous (SMT).

Multicore architecture.Configurable memory hierarchy, cache coherence, interconnection networks.

State-of-the-art benchmarks. Tested support for common research benchmarks, available for download.

GPU modelSupport for OpenCL benchmarks.Model for Evergreen ISA.

• Multi2Sim 1.x version series, 2007 (MIPS-based)

• Multi2Sim 2.x version series, 2008 (x86-based)


1. IntroductionGetting Started

• User-friendly installation and test

$ tar -xzf multi2sim-3.1.tar.gz

$ cd multi2sim-3.1$ ./configure$ make$ sudo make install

• Application-only simulator

Original execution Simulated execution

$ ./test-args hola que tal arg[0] = 'hola' arg[1] = 'que' arg[2] = 'tal'

$ m2s ./test-args hola que tal <... Simulator output ...> arg[0] = 'hola' arg[1] = 'que' arg[2] = 'tal' <... Simulator statistics ...>


1. IntroductionThe IniFile Format

• Example of IniFile

; This is a comment.

[ Section 0 ]Color = RedHeight = 40

[ OtherSection ]Variable = Value

Demo 1

• Multi2Sim uses IniFile for– Configuration files.

– Output statistic files.

– Standard error output.


Block 1

The x86 CPUSimulation


2. The CPU EmulationDefinition

• Emulation (a.k.a. functional simulation)

– Just mimic original behavior of a program.

– … as opposed to timing/detailed/architectural simulation.

• Steps

1) Program loading.

2) Simulation loop.


2. The CPU EmulationProgram Loading

• Initialization of a process state– Virtual memory map.– Value of x86 registers.

StackProgram arguments

Environment variables

0x08000000

mmap region(not initialized)

HeapInitialized data

TextInitialized data

0x08xxxxxx

0x40000000

0xc0000000

eax

ebx

eax

ecx

esp

eip

Init

iali

zed

inst

ruct

ion

poin

ter

Top

of s

tack

1) Parse ELF executable– ELF sections.– Initialized code and data.

2) Initialize stack– Program headers.– Arguments.– Environment variables.

3) Initialize registers– Program entry point → eip– Stack pointer → esp


2. The CPU EmulationSimulation Loop

Demo 2

Read instr.at eip

Instr.bytes

Decodeinstruction

Instr.fields

Instr. isint 0x80

No Yes

Emulatesystem call

Emulatex86 instr.

Move eipto next instr.

• Emulation of x86 instructions– Update memory map (if needed).

– Update x86 registers.

– Example: add [bp+16], 0x5

• Emulation of Linux system calls– Analyze system call code and args.

– Update memory map.

– Update eax with return value.

– Example: read(fd, buf, count);


3. The CPU Architectural SimulationDefinition

• Architectural simulation (a.k.a. detailed/timing simulation)

– Provides performance results from executing a programon a configurable CPU model.

– Main performance metric: execution time.But also structures occupancy, cache hit rates, contention points...

ArchitecturalSimulator

cycle counterCPU

functionalsimulator

CPU coresmodel

Memory hierarchymodel

Run a new x86 instruction

This is the isntr.that was run


3. The CPU Architectural SimulationThe Superscalar Pipeline

Demo 3

Fetch

Instr.Cache

Fetch queue

Dispatch

· · ·Reorder Buffer

· · ·

· · ·Instruction Queue

· · ·Load/Store Queue Issue

Commit

DataCache

RegisterFile

FUTrace queue

· · ·

TraceCache

Decodeμop queue

· · ·

Writeback

• Characteristics– Speculative execution.

– Branch prediction.

– Out-of-order execution.

– Trace cache.


3. The CPU Architectural SimulationMultithreaded Processor Model

Fetch

Instr.Cache

Dispatch

· · ·· · ·

· · ·

· · ·Issue

Commit

DataCache

RegisterFile

FU· · ·

TraceCache

Decode · · ·

Writeback

Fetch

Instr.Cache

Dispatch

· · ·· · ·

· · ·

· · ·Issue

Commit

DataCache

RegisterFile

FU· · ·

TraceCache

Decode · · ·

Writeback

Fetch

Instr.Cache

Dispatch

· · ·· · ·

· · ·

· · ·Issue

Commit

DataCache

RegisterFile

FU· · ·

TraceCache

Decode · · ·

Writeback

Shared FunctionalUnit Pool

• Multithreading Paradigms

– Coarse grain multithreadingThread switch upon long-latency events.

– Fine grain multithreadingThread switch at a cycle granularity.

– Simultaneous multithreadingMultiple-thread issuing of instructions.


3. The CPU Architectural SimulationMulticore Processor Model

Core 0 Core 1

· · ·

Memory Hierarchy

Fetch

Instr.Cache

Dispatch

· · ·· · ·

· · ·

· · ·Issue

Commit

DataCache

RegisterFile

FU· · ·

TraceCache

Decode · · ·

Writeback

Fetch

Instr.Cache

Dispatch

· · ·· · ·

· · ·

· · ·Issue

Commit

DataCache

RegisterFile

FU· · ·

TraceCache

Decode · · ·

Writeback

• Multicore Processor– Multiple independent superscalar pipelines.

– Communication only through memory hierarchy.

Demo 4

• What can we run on it?– Multiple single-threaded programs.

– One (or more) programs spawning child threads.


3. The CPU Architectural SimulationDefinitions

• Core (c-0, c-1, ...)– Hardware component with an independent set of superscalar pipelines.

– Each core may contain several threads.

Demo 4

• Thread (t-0, t-1, ...)– Hardware component with a partially independent set of pipeline stages.

• Context (ctx-0, ctx-1, ...)– Software thread with independent value for registers (incl. eip).

– Can be a sequential program or a spawned child context.

• Node– Hardware component running a context.

– Multicore proc.: c0, c1, … Multithreaded proc.: t0, t1, …Multicore-multithreaded proc.: c0-t0, c0-t1, ...


• Configuring memory hierarchy– Any number of caches organized in any number of levels.

– Connected through any number of interconnects.

– A set of 1 or more caches must connect to an interconnect from “above”.Only one cache –or main memory– connected “below”.

4. Memory HierarchyConfiguration

• Memory hierarchy entries– Each node has two entries to the memory hierarchy:

Instruction entry + Data entry

– Several node entries can converge to the same cache (or main memory).

· · ·

Interconnect

Cache Cache Cache

Cache orMain Memory


4. Memory HierarchyConfiguration

c0-t0

DataL1

Instr.L1

c0-t1

DataL1

Instr.L1

Core 0

c1-t0

DataL1

Instr.L1

c1-t1

DataL1

Instr.L1

Core 1

L2 Cache L2 Cache

Main Memory

• Example– 2-core, 2-threaded processor (4 nodes).

– Each thread has its own private data and instruction L1 caches.

– L2 caches: shared among threads, private per core, unified for data/instr.

Demo 5


5. Benchmarks and SimulationsSupported CPU Benchmarks

• Sequential benchmarks– SPEC CPU 2000

– SPEC CPU 2006

– MediaBench-I

Demo 6

• Parallel benchmarks– SPLASH-2

– PARSEC 2.1

• Availability on website– x86 binaries tested on Multi2Sim.

– List of execution commands.

– Data files for free-distribution benchmarks.


Block 2

The AMD Evergreen GPUSimulation


6. The OpenCL Programming ModelIntroduction

• GPU– Massively parallel device.

– Originally devoted to graphics computations.

– Now getting popular for general purpose computations (GPGPU).

– Single-Program Multiple-Data (SIMP) model.

• Major GPU vendors– NVIDIA → CUDA programming language.

– AMD → OpenCL programming language.


6. The OpenCL Programming ModelVector Addition Example

int main(){

[ ... ]

clCreateProgramWithSource(...,"vector_add.cl", ...);

clCreateKernel(..., "vector_add",...);

buf1 = clCreateBuffer(..., CL_MEM_READ,size, ...);

buf2 = clCreateBuffer(..., CL_MEM_READ,size, ...);

buf3 = clCreateBuffer(..., CL_MEM_WRITE,size, ...);

clSetKernelArg(..., 0, buf1, ...);clSetKernelArg(..., 1, buf2, ...);clSetKernelArg(..., 2, buf3, ...);

clEnqueueNDRangeKernel(...);

[ ... ]}

OpenCL Host Programvector_add.c

OpenCL Device Kernelvector_add.cl

__kernel void vector_add(__read_only __global int *buf1,__read_only __global int *buf2,__write_only __global int *buf3)

{int id = get_global_id(0);

buf3[id] = buf1[id] + buf2[id];}

x86 executable binaryvector_add

AMD Evergreen kernel binaryvector_add.bin


6. The OpenCL Programming ModelOpenCL Software Entities

Common OpenCLKernel:

__kernel func(){

}

Work- group

... Work- item

Work- group

Work- group

Work- group

...

...

...

ND-Range

...

...

...

...

Work-group

Work- item

Work- item

Work- item

Work-item

Global memory Local memory Private memory

(Synchronizationallowed at this level)

• Properties– Host program configures ND-Range and Work-group sizes.

– Only Work-items in the same Work-group can synchronize and share data.

– Work-groups in ND-Range can execute in any order.


7. The Evergreen GPU EmulationThe OpenCL Call Stack

Opera

ting

syst

em

code

Use

r-sp

ace

code

OpenCL function call(e.g., clEnqueueNDRangeKernel)

OpenCL host program

AMD OpenCL library(libOpenCL.so)

System calls(mainly ioctl)

GPU Driver

Mult

i2Sim

Em

ula

ted

pro

gra

m

OpenCLfunction call

OpenCL host program

Multi2Sim OpenCL library(m2s-libpencl.so)

Special system call(code 325)

GPU Emulator

Native Execution Simulated Execution

• Comparison– OpenCL function calls are forwarded to m2s-libopencl.so.

– Each function is implemented as a system call 325.

– Multi2Sim emulates GPU after clEnqueueNDRangeKernel.


7. The Evergreen GPU EmulationProgram Loading

• Initialization of device kernel– Global memory map (whole ND-Range).

– Local memories (each work-group).

– Register files (each work-item).

Work-item Work-item · · ·

Work-group

Work-item Work-item · · ·

Work-group · · ·

ND-RangeGlobal

Memory

LocalMemories

RegisterFiles

OpenCL kernel binary(vector_add.bin)


7. The Evergreen GPU EmulationEvergreen Assembly Code

• Structure– Main Control Flow (CF) clause.

– Secondary Arithmetic-Logic (ALU) and Texture (TEX) clauses.

– ALU instructions are VLIW.

00 ALU: ADDR(32) CNT(8) KCACHE0(CB1:0-15) 0 x: LSHL R3.x, R0.x, 1 w: LSHL ____, R0.x, (0x3).x t: MOV R8.x, 1 1 x: LSHL R5.x, PV1.x, (0x2).x y: LSHR R1.y, PV1.z, (0x2).x z: ADD_INT ____, KC0[1].x, PV2.x t: LSHR R7.x, KC0[3].x, 1 2 y: LSHR R2.y, PV3.z, (0x2).x

01 TEX: ADDR(144) CNT(2) 3 VFETCH R1.x___, R1.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 4 VFETCH R2.x___, R2.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET)

02 ALU_PUSH_BEFORE: ADDR(47) CNT(3) 5 x: LDS_WRITE ____, R1.w, R1.x 6 x: LDS_WRITE ____, R6.x, R2.x 7 x: PREDNE_INT ____, R7.x, 0.0f UPDATE_EXEC_MASK UPDATE_PRED

03 JUMP POP_CNT(1) ADDR(13)

04 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R1].x___, R0, ARRAY_SIZE(4) MARK VPM

CF InstructionCounter

Secondary ClauseInstruction Counter

Secondary ALUClause

Secondary TEXClause


7. The Evergreen GPU EmulationSimulation Loop

Emulate(all work-items)

Emulate(all work-items)

Read CFinstruction

Instr.bytes

Decodeinstruction

Instr.fields

Instr. isCF?

YesNo

Start ALU/TEX clause

Read ALU/TEX instr.

Instr.bytes

Decodeinstruction

Instr.fields

End ofclause?

No

Go UpYes

• Execution of CF clause– Instructions affecting control flow.

– Synchronization operations.

– Writes to global memory.

• Secondary ALU clause– Arithmetic-logic operations.

– Accesses to local memory.

• Secondary TEX clause– Reads from global memory.

Demo 7


8. The GPU Architectural SimulationAMD Evergreen GPU Architecture

• The GPU Compute Device– Pool of pending work-groups (Wgs).

– Set of compute units (Cus).

– Dispatcher – maps WGs to CUs.

– Global memory hierarchy.ComputeUnit 0

ComputeUnit 1

ComputeUnit N-1· · ·

Work-group dispatcher

PendingWork-group

pool

Global Memory HierarchyALU

Engine

CFEngine

TEXEngine

Register File

ReadyWavefront

Pool Glo

bal M

emor

y(r

eads

)

Loca

lM

emor

y

ALU

Cla

use

TEX C

laus

e

Global m

emo ry

(writes)

• Compute Unit– Pool of pending wavefronts (Wfs)

– Three execution engines.

– Local memory.

– Register file.


8. The GPU Architectural SimulationExecution Engines

Fetch(one WF)

Instr. bytesbuffers

Decode(round-robin)

InstructionMemory

(CFClause)

From ReadyWavefront

Pool

Extr

act

WF

To ReadyWavefront

Pool

Inse

rtW

F

WF0

WF1

WFN-1

···

CF Instr.buffers(1 entryper WF)

WF0

WF1

WFN-1

···

Execute (round-robin)

Launch secondaryALU clause

Launch secondaryTEX clause

ExecuteCF instruction

Complete

• Control Flow (CF) Engine– 4 stages.

– Extracts one WF from pool at fetch stage.

– Places a WF back into pool at complete stage.

– Secondary clauses can be launched at execute stage.



···Instruction

bytes

DecodeRead(each

SubWF)

SubWF 0SubWF 1SubWF 2

xyzwt

...

...

Stream Core 0

Processing

Ele

me

ntsP

ipelin

eS

tages

Stream Core 1

Stream Core N–1

...

Work-Item 0SubWF 0, 1, ...

Write

Execute

InstructionMemory

(ALUclauses)

Loca

lM

emor

y LocalM

emory

FromRegister

File

ToRegister

File

xyzwt

VLIWbundlebuffer

(1 entry)

Work-Item 0SubWF 0, 1, ...

Work-Item N-1SubWF 0, 1, ...

Fetch(one WF)

• Arithmetic-Logic (ALU) Engine– 5 stages.– WF is split into SubWFs at the read stage.– SubWF size is equal to number of available Stream Cores (Scs).– Each SC has 5 pipelined processing elements (x, y, z, w, t).



...Fetch

(one WF)Instruction

bytes

Decode

InstructionMemory

(TEXClauses)

Read

Request toL1 cache

(Global Mem.)

Write

Data fromL1 cache

ToRegister

File

TEXinstr.buffer

(1 entry)

FromRegister

File

addr

.

data

• Control Flow (CF) Engine– 4 stages.

– Global memory reads are issued at read stage.

– They complete at write stage.


8. The GPU Architectural SimulationSummary of Work-items Grouping

• ND-Range– Group of all work-items for one kernel launch.

• Work-group– Work-items can perform synchronizations.

– Work-items share a fast-access local memory.

• Wavefront– SIMD execution unit.

• Subwavefront– Work-items that can be issued to Stream Cores at a time.

Op

en

CL P

rog

. Mod

el

GP

U A

rch

itectu

re

Demo 8


9. Benchmarks and SimulationsSupported GPU Benchmarks

• AMD SDK's OpenCL Benchmarks– Matrix computations.

– Financial benchmarks.

– Sorting algorithms.

– etc.

• Features– Provided in Multi2Sim site as x86 + Evergreen binaries.

– Command-line can be tuned for different input sizes.

– Provide both CPU and GPU implementations, with self-check.

Demo 9


10. ConclusionsSimulation Capabilities

• x86 CPU Simulation– ISA-level.

– No need for full-system simulation.

– Superscalar/multithreaded/multicore.

– Memory hierarchies and interconnects.

– State-of-the-art benchmarks.

• AMD Evergreen GPU Simulation– ISA-level.

– First full architectural simulation framework.

– Realistic GPU pipeline (based on AMD Radeon 5870).

– Memory hierarchies and interconnects.


10. ConclusionsAdditional Material

• The Multi2Sim Guide– Complete documentation.

– “Getting started” sections, with execution examples.

– Description of CPU and GPU architectural models.

• The Multi2Sim Forum– Discussion forum for Multi2Sim users.

• The Multi2Sim Mailing List– Announcements of new versions, updated documentation, etc.


10. ConclusionsFuture Work

• Extending support for benchmarks– Support for the entire OpenCL specification.

– Support for the entire Evergreen ISA.

– Support for the complete AMD SDK suite, and other upcoming benchmarks.

• Focus on heterogeneous architectures– Model for AMD Fusion.

– CPU and GPU working concurrently.

– Supporting/designing benchmarks with heterogeneous processing.

• Maintenance of CPU simulation– Issues reported by Multi2Sim users.

– Stability and support increases day by day.

Conference title 36

The Multi2Sim Simulation Framework

A CPU-GPU Modelfor Heterogeneous Computing

www.multi2sim.org

Rafael UbalDavid R. Kaeli

Northeastern UniversityBoston, MA

Documents

The Multi2Sim Simulation Framework Multi2Sim Simulation Framework, PACT 2011 Tutorial 24 7. The Evergreen GPU Emulation ... The Multi2Sim Simulation Framework, PACT 2011 Tutorial 26