Upload
phungkhanh
View
221
Download
2
Embed Size (px)
Citation preview
Conference title 1
The Multi2Sim Simulation Framework
A CPU-GPU Modelfor Heterogeneous Computing
www.multi2sim.org
Rafael UbalDavid R. Kaeli
Northeastern UniversityBoston, MA
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 2
Outline
1. Introduction
First Block – The x86 CPU Simulation
2. The x86 CPU Emulation3. The x86 CPU Architectural Simulation4. The Memory Hierarchy5. Benchmarks and Simulations
Second Block – The AMD Evergreen GPU Simulation
6. The OpenCL Programming Model7. The AMD Evergreen GPU Emulation8. The AMD Evergreen GPU Architectural Simulation9. Benchmarks and Simulations
10. Conclusions and Future work
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 3
1. IntroductionMotivation
• Limitations of existing CPU simulators– Such as SimpleScalar, Simics, SSMT, M-Sim, SMTSim,
M5, ...– Full-system vs. application-only simulation.– Free, open-source.– Architectural simulation accuracy.– Alpha/PISA architectures → cross-compilers.– Integrated system.
• Current simulation needs– Based on current processor market.– Heterogeneous CPU-GPU environments.– Tool for evaluation of new architectural proposals.– Simulation of a GPU ISA.
• Existing GPU simulation approaches– Barra: NVIDIA Telsa ISA.– Ocelot: PTX intermediate language simulator.– No architectural simulation.– No emulation of AMD ISAs.– Not capable of heterogeneous simulation.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 4
1. IntroductionMulti2Sim Background
• Multi2Sim 3.x version series, 2011 (x86+Evergreen)
Superscalar pipelineOut-of-order execution, branch prediction, trace cache, etc.
MultithreadingFine-grain, coarse-grain and simultaneous (SMT).
Multicore architecture.Configurable memory hierarchy, cache coherence, interconnection networks.
State-of-the-art benchmarks. Tested support for common research benchmarks, available for download.
GPU modelSupport for OpenCL benchmarks.Model for Evergreen ISA.
• Multi2Sim 1.x version series, 2007 (MIPS-based)
• Multi2Sim 2.x version series, 2008 (x86-based)
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 5
1. IntroductionGetting Started
• User-friendly installation and test
$ tar -xzf multi2sim-3.1.tar.gz
$ cd multi2sim-3.1$ ./configure$ make$ sudo make install
• Application-only simulator
Original execution Simulated execution
$ ./test-args hola que tal arg[0] = 'hola' arg[1] = 'que' arg[2] = 'tal'
$ m2s ./test-args hola que tal <... Simulator output ...> arg[0] = 'hola' arg[1] = 'que' arg[2] = 'tal' <... Simulator statistics ...>
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 6
1. IntroductionThe IniFile Format
• Example of IniFile
; This is a comment.
[ Section 0 ]Color = RedHeight = 40
[ OtherSection ]Variable = Value
Demo 1
• Multi2Sim uses IniFile for– Configuration files.
– Output statistic files.
– Standard error output.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 7
Block 1
The x86 CPUSimulation
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 8
2. The CPU EmulationDefinition
• Emulation (a.k.a. functional simulation)
– Just mimic original behavior of a program.
– … as opposed to timing/detailed/architectural simulation.
• Steps
1) Program loading.
2) Simulation loop.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 9
2. The CPU EmulationProgram Loading
• Initialization of a process state– Virtual memory map.– Value of x86 registers.
StackProgram arguments
Environment variables
0x08000000
mmap region(not initialized)
HeapInitialized data
TextInitialized data
0x08xxxxxx
0x40000000
0xc0000000
eax
ebx
eax
ecx
esp
eip
Init
iali
zed
inst
ruct
ion
poin
ter
Top
of s
tack
1) Parse ELF executable– ELF sections.– Initialized code and data.
2) Initialize stack– Program headers.– Arguments.– Environment variables.
3) Initialize registers– Program entry point → eip– Stack pointer → esp
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 10
2. The CPU EmulationSimulation Loop
Demo 2
Read instr.at eip
Instr.bytes
Decodeinstruction
Instr.fields
Instr. isint 0x80
No Yes
Emulatesystem call
Emulatex86 instr.
Move eipto next instr.
• Emulation of x86 instructions– Update memory map (if needed).
– Update x86 registers.
– Example: add [bp+16], 0x5
• Emulation of Linux system calls– Analyze system call code and args.
– Update memory map.
– Update eax with return value.
– Example: read(fd, buf, count);
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 11
3. The CPU Architectural SimulationDefinition
• Architectural simulation (a.k.a. detailed/timing simulation)
– Provides performance results from executing a programon a configurable CPU model.
– Main performance metric: execution time.But also structures occupancy, cache hit rates, contention points...
ArchitecturalSimulator
cycle counterCPU
functionalsimulator
CPU coresmodel
Memory hierarchymodel
Run a new x86 instruction
This is the isntr.that was run
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 12
3. The CPU Architectural SimulationThe Superscalar Pipeline
Demo 3
Fetch
Instr.Cache
Fetch queue
Dispatch
· · ·Reorder Buffer
· · ·
· · ·Instruction Queue
· · ·Load/Store Queue Issue
Commit
DataCache
RegisterFile
FUTrace queue
· · ·
TraceCache
Decodeμop queue
· · ·
Writeback
• Characteristics– Speculative execution.
– Branch prediction.
– Out-of-order execution.
– Trace cache.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 13
3. The CPU Architectural SimulationMultithreaded Processor Model
Fetch
Instr.Cache
Dispatch
· · ·· · ·
· · ·
· · ·Issue
Commit
DataCache
RegisterFile
FU· · ·
TraceCache
Decode · · ·
Writeback
Fetch
Instr.Cache
Dispatch
· · ·· · ·
· · ·
· · ·Issue
Commit
DataCache
RegisterFile
FU· · ·
TraceCache
Decode · · ·
Writeback
Fetch
Instr.Cache
Dispatch
· · ·· · ·
· · ·
· · ·Issue
Commit
DataCache
RegisterFile
FU· · ·
TraceCache
Decode · · ·
Writeback
Shared FunctionalUnit Pool
• Multithreading Paradigms
– Coarse grain multithreadingThread switch upon long-latency events.
– Fine grain multithreadingThread switch at a cycle granularity.
– Simultaneous multithreadingMultiple-thread issuing of instructions.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 14
3. The CPU Architectural SimulationMulticore Processor Model
Core 0 Core 1
· · ·
Memory Hierarchy
Fetch
Instr.Cache
Dispatch
· · ·· · ·
· · ·
· · ·Issue
Commit
DataCache
RegisterFile
FU· · ·
TraceCache
Decode · · ·
Writeback
Fetch
Instr.Cache
Dispatch
· · ·· · ·
· · ·
· · ·Issue
Commit
DataCache
RegisterFile
FU· · ·
TraceCache
Decode · · ·
Writeback
• Multicore Processor– Multiple independent superscalar pipelines.
– Communication only through memory hierarchy.
Demo 4
• What can we run on it?– Multiple single-threaded programs.
– One (or more) programs spawning child threads.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 15
3. The CPU Architectural SimulationDefinitions
• Core (c-0, c-1, ...)– Hardware component with an independent set of superscalar pipelines.
– Each core may contain several threads.
Demo 4
• Thread (t-0, t-1, ...)– Hardware component with a partially independent set of pipeline stages.
• Context (ctx-0, ctx-1, ...)– Software thread with independent value for registers (incl. eip).
– Can be a sequential program or a spawned child context.
• Node– Hardware component running a context.
– Multicore proc.: c0, c1, … Multithreaded proc.: t0, t1, …Multicore-multithreaded proc.: c0-t0, c0-t1, ...
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 16
• Configuring memory hierarchy– Any number of caches organized in any number of levels.
– Connected through any number of interconnects.
– A set of 1 or more caches must connect to an interconnect from “above”.Only one cache –or main memory– connected “below”.
4. Memory HierarchyConfiguration
• Memory hierarchy entries– Each node has two entries to the memory hierarchy:
Instruction entry + Data entry
– Several node entries can converge to the same cache (or main memory).
· · ·
Interconnect
Cache Cache Cache
Cache orMain Memory
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 17
4. Memory HierarchyConfiguration
c0-t0
DataL1
Instr.L1
c0-t1
DataL1
Instr.L1
Core 0
c1-t0
DataL1
Instr.L1
c1-t1
DataL1
Instr.L1
Core 1
L2 Cache L2 Cache
Main Memory
• Example– 2-core, 2-threaded processor (4 nodes).
– Each thread has its own private data and instruction L1 caches.
– L2 caches: shared among threads, private per core, unified for data/instr.
Demo 5
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 18
5. Benchmarks and SimulationsSupported CPU Benchmarks
• Sequential benchmarks– SPEC CPU 2000
– SPEC CPU 2006
– MediaBench-I
Demo 6
• Parallel benchmarks– SPLASH-2
– PARSEC 2.1
• Availability on website– x86 binaries tested on Multi2Sim.
– List of execution commands.
– Data files for free-distribution benchmarks.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 19
Block 2
The AMD Evergreen GPUSimulation
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 20
6. The OpenCL Programming ModelIntroduction
• GPU– Massively parallel device.
– Originally devoted to graphics computations.
– Now getting popular for general purpose computations (GPGPU).
– Single-Program Multiple-Data (SIMP) model.
• Major GPU vendors– NVIDIA → CUDA programming language.
– AMD → OpenCL programming language.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 21
6. The OpenCL Programming ModelVector Addition Example
int main(){
[ ... ]
clCreateProgramWithSource(...,"vector_add.cl", ...);
clCreateKernel(..., "vector_add",...);
buf1 = clCreateBuffer(..., CL_MEM_READ,size, ...);
buf2 = clCreateBuffer(..., CL_MEM_READ,size, ...);
buf3 = clCreateBuffer(..., CL_MEM_WRITE,size, ...);
clSetKernelArg(..., 0, buf1, ...);clSetKernelArg(..., 1, buf2, ...);clSetKernelArg(..., 2, buf3, ...);
clEnqueueNDRangeKernel(...);
[ ... ]}
OpenCL Host Programvector_add.c
OpenCL Device Kernelvector_add.cl
__kernel void vector_add(__read_only __global int *buf1,__read_only __global int *buf2,__write_only __global int *buf3)
{int id = get_global_id(0);
buf3[id] = buf1[id] + buf2[id];}
x86 executable binaryvector_add
AMD Evergreen kernel binaryvector_add.bin
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 22
6. The OpenCL Programming ModelOpenCL Software Entities
Common OpenCLKernel:
__kernel func(){
}
Work- group
... Work- item
Work- group
Work- group
Work- group
...
...
...
ND-Range
...
...
...
...
Work-group
Work- item
Work- item
Work- item
Work-item
Global memory Local memory Private memory
(Synchronizationallowed at this level)
• Properties– Host program configures ND-Range and Work-group sizes.
– Only Work-items in the same Work-group can synchronize and share data.
– Work-groups in ND-Range can execute in any order.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 23
7. The Evergreen GPU EmulationThe OpenCL Call Stack
Opera
ting
syst
em
code
Use
r-sp
ace
code
OpenCL function call(e.g., clEnqueueNDRangeKernel)
OpenCL host program
AMD OpenCL library(libOpenCL.so)
System calls(mainly ioctl)
GPU Driver
Mult
i2Sim
Em
ula
ted
pro
gra
m
OpenCLfunction call
OpenCL host program
Multi2Sim OpenCL library(m2s-libpencl.so)
Special system call(code 325)
GPU Emulator
Native Execution Simulated Execution
• Comparison– OpenCL function calls are forwarded to m2s-libopencl.so.
– Each function is implemented as a system call 325.
– Multi2Sim emulates GPU after clEnqueueNDRangeKernel.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 24
7. The Evergreen GPU EmulationProgram Loading
• Initialization of device kernel– Global memory map (whole ND-Range).
– Local memories (each work-group).
– Register files (each work-item).
Work-item Work-item · · ·
Work-group
Work-item Work-item · · ·
Work-group · · ·
ND-RangeGlobal
Memory
LocalMemories
RegisterFiles
OpenCL kernel binary(vector_add.bin)
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 25
7. The Evergreen GPU EmulationEvergreen Assembly Code
• Structure– Main Control Flow (CF) clause.
– Secondary Arithmetic-Logic (ALU) and Texture (TEX) clauses.
– ALU instructions are VLIW.
00 ALU: ADDR(32) CNT(8) KCACHE0(CB1:0-15) 0 x: LSHL R3.x, R0.x, 1 w: LSHL ____, R0.x, (0x3).x t: MOV R8.x, 1 1 x: LSHL R5.x, PV1.x, (0x2).x y: LSHR R1.y, PV1.z, (0x2).x z: ADD_INT ____, KC0[1].x, PV2.x t: LSHR R7.x, KC0[3].x, 1 2 y: LSHR R2.y, PV3.z, (0x2).x
01 TEX: ADDR(144) CNT(2) 3 VFETCH R1.x___, R1.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET) 4 VFETCH R2.x___, R2.y, fc156 MEGA(4) FETCH_TYPE(NO_INDEX_OFFSET)
02 ALU_PUSH_BEFORE: ADDR(47) CNT(3) 5 x: LDS_WRITE ____, R1.w, R1.x 6 x: LDS_WRITE ____, R6.x, R2.x 7 x: PREDNE_INT ____, R7.x, 0.0f UPDATE_EXEC_MASK UPDATE_PRED
03 JUMP POP_CNT(1) ADDR(13)
04 MEM_RAT_CACHELESS_STORE_RAW: RAT(1)[R1].x___, R0, ARRAY_SIZE(4) MARK VPM
CF InstructionCounter
Secondary ClauseInstruction Counter
Secondary ALUClause
Secondary TEXClause
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 26
7. The Evergreen GPU EmulationSimulation Loop
Emulate(all work-items)
Emulate(all work-items)
Read CFinstruction
Instr.bytes
Decodeinstruction
Instr.fields
Instr. isCF?
YesNo
Start ALU/TEX clause
Read ALU/TEX instr.
Instr.bytes
Decodeinstruction
Instr.fields
End ofclause?
No
Go UpYes
• Execution of CF clause– Instructions affecting control flow.
– Synchronization operations.
– Writes to global memory.
• Secondary ALU clause– Arithmetic-logic operations.
– Accesses to local memory.
• Secondary TEX clause– Reads from global memory.
Demo 7
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 27
8. The GPU Architectural SimulationAMD Evergreen GPU Architecture
• The GPU Compute Device– Pool of pending work-groups (Wgs).
– Set of compute units (Cus).
– Dispatcher – maps WGs to CUs.
– Global memory hierarchy.ComputeUnit 0
ComputeUnit 1
ComputeUnit N-1· · ·
Work-group dispatcher
PendingWork-group
pool
Global Memory HierarchyALU
Engine
CFEngine
TEXEngine
Register File
ReadyWavefront
Pool Glo
bal M
emor
y(r
eads
)
Loca
lM
emor
y
ALU
Cla
use
TEX C
laus
e
Global m
emo ry
(writes)
• Compute Unit– Pool of pending wavefronts (Wfs)
– Three execution engines.
– Local memory.
– Register file.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 28
8. The GPU Architectural SimulationExecution Engines
Fetch(one WF)
Instr. bytesbuffers
Decode(round-robin)
InstructionMemory
(CFClause)
From ReadyWavefront
Pool
Extr
act
WF
To ReadyWavefront
Pool
Inse
rtW
F
WF0
WF1
WFN-1
···
CF Instr.buffers(1 entryper WF)
WF0
WF1
WFN-1
···
Execute (round-robin)
Launch secondaryALU clause
Launch secondaryTEX clause
ExecuteCF instruction
Complete
• Control Flow (CF) Engine– 4 stages.
– Extracts one WF from pool at fetch stage.
– Places a WF back into pool at complete stage.
– Secondary clauses can be launched at execute stage.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 29
8. The GPU Architectural SimulationExecution Engines
···Instruction
bytes
DecodeRead(each
SubWF)
SubWF 0SubWF 1SubWF 2
xyzwt
...
...
Stream Core 0
Processing
Ele
me
ntsP
ipelin
eS
tages
Stream Core 1
Stream Core N–1
...
Work-Item 0SubWF 0, 1, ...
Write
Execute
InstructionMemory
(ALUclauses)
Loca
lM
emor
y LocalM
emory
FromRegister
File
ToRegister
File
xyzwt
VLIWbundlebuffer
(1 entry)
Work-Item 0SubWF 0, 1, ...
Work-Item N-1SubWF 0, 1, ...
Fetch(one WF)
• Arithmetic-Logic (ALU) Engine– 5 stages.– WF is split into SubWFs at the read stage.– SubWF size is equal to number of available Stream Cores (Scs).– Each SC has 5 pipelined processing elements (x, y, z, w, t).
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 30
8. The GPU Architectural SimulationExecution Engines
...Fetch
(one WF)Instruction
bytes
Decode
InstructionMemory
(TEXClauses)
Read
Request toL1 cache
(Global Mem.)
Write
Data fromL1 cache
ToRegister
File
TEXinstr.buffer
(1 entry)
FromRegister
File
addr
.
data
• Control Flow (CF) Engine– 4 stages.
– Global memory reads are issued at read stage.
– They complete at write stage.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 31
8. The GPU Architectural SimulationSummary of Work-items Grouping
• ND-Range– Group of all work-items for one kernel launch.
• Work-group– Work-items can perform synchronizations.
– Work-items share a fast-access local memory.
• Wavefront– SIMD execution unit.
• Subwavefront– Work-items that can be issued to Stream Cores at a time.
Op
en
CL P
rog
. Mod
el
GP
U A
rch
itectu
re
Demo 8
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 32
9. Benchmarks and SimulationsSupported GPU Benchmarks
• AMD SDK's OpenCL Benchmarks– Matrix computations.
– Financial benchmarks.
– Sorting algorithms.
– etc.
• Features– Provided in Multi2Sim site as x86 + Evergreen binaries.
– Command-line can be tuned for different input sizes.
– Provide both CPU and GPU implementations, with self-check.
Demo 9
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 33
10. ConclusionsSimulation Capabilities
• x86 CPU Simulation– ISA-level.
– No need for full-system simulation.
– Superscalar/multithreaded/multicore.
– Memory hierarchies and interconnects.
– State-of-the-art benchmarks.
• AMD Evergreen GPU Simulation– ISA-level.
– First full architectural simulation framework.
– Realistic GPU pipeline (based on AMD Radeon 5870).
– Memory hierarchies and interconnects.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 34
10. ConclusionsAdditional Material
• The Multi2Sim Guide– Complete documentation.
– “Getting started” sections, with execution examples.
– Description of CPU and GPU architectural models.
• The Multi2Sim Forum– Discussion forum for Multi2Sim users.
• The Multi2Sim Mailing List– Announcements of new versions, updated documentation, etc.
The Multi2Sim Simulation Framework, PACT 2011 Tutorial 35
10. ConclusionsFuture Work
• Extending support for benchmarks– Support for the entire OpenCL specification.
– Support for the entire Evergreen ISA.
– Support for the complete AMD SDK suite, and other upcoming benchmarks.
• Focus on heterogeneous architectures– Model for AMD Fusion.
– CPU and GPU working concurrently.
– Supporting/designing benchmarks with heterogeneous processing.
• Maintenance of CPU simulation– Issues reported by Multi2Sim users.
– Stability and support increases day by day.
Conference title 36
The Multi2Sim Simulation Framework
A CPU-GPU Modelfor Heterogeneous Computing
www.multi2sim.org
Rafael UbalDavid R. Kaeli
Northeastern UniversityBoston, MA