26
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY OCELOT: PTX EMULATOR 1

Ocelot: PTX Emulator

  • Upload
    millie

  • View
    58

  • Download
    0

Embed Size (px)

DESCRIPTION

Ocelot: PTX Emulator. Overview. Ocelot PTX Emulator Multicore-Backend NVIDIA GPU Backend AMD GPU Backend. Execution Model . NVIDIA’s PTX Execution Model. Array of multiprocessors each executing a CTA SIMD multiprocessors Single instruction multiple thread (SIMT) execution - PowerPoint PPT Presentation

Citation preview

Page 1: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 1

OCELOT: PTX EMULATOR

Page 2: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 2

OverviewOcelot PTX Emulator

Multicore-Backend

NVIDIA GPU Backend

AMD GPU Backend

Page 3: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Execution Model

Parallel thread execution (PTX) Explicit memory hierarchy Cooperative thread arrays (CTAs)

and ordering constraints

Array of multiprocessors each executing a CTA

SIMD multiprocessors Single instruction multiple thread (SIMT) execution Enables hardware to exploit

control flow uniformity and data locality

3

NVIDIA’s PTX Execution Model

Page 4: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

The Ocelot PTX EmulatorPerforms functional simulation of the PTX execution model

Access to complete machine stateEnables detailed performance evaluation/debugging

e.g. Program correctness checking Alignment checks, out of bounds etc.

Workload characterization and modeling

Trace generation to drive architecture simulators

Timing information not availablePTX 3.0 (Fermi) support

Abstract machine model

4

4

Page 5: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Emulator ImplementationSerialized execution of CTAs

Implements the CUDA execution model semantics CUDA does not guarantee concurrent CTA execution!

Implements abstract reconvergence mechanism Multiple reconvergence mechanisms are implemented:

IPDOM, Barrier, Thread frontiers

Software support for special functions: Texture sampling

1D, 2D, 3D, cube Nearest, linear

New instructions may be prototyped

Page 6: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Ocelot Source Code: PTX Emulator Device Backend• ocelot/

• executive/

• interface/EmulatorDevice.h• interface/EmulatedKernel.h• interface/EmulatorCallStack.h• interface/CooperativeThreadArray.h• interface/ReconvergenceMechanism.h• interface/TextureOperations.h

• trace/

• interface/TraceEvent.h• interface/TraceGenerator.h

6

Page 7: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Trace Generator Interface7

7

Emulator broadcasts events to event trace analyzers during execution

Events provide detailed device state: PC, activity mask, operand data, thread ID, etc.

Used for error checking, instrumentation, and simulation

Page 8: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Trace Generator Interface// ocelot/trace/interface/TraceGenerator.h

// Base class for generating traces

class TraceGenerator {public:TraceGenerator();

virtual ~TraceGenerator();

// Called when a traced kernel is launched to

// retrieve some parameters from the kernel

virtual void initialize( const executive::ExecutableKernel& kernel);

// Called whenever an event takes place.

virtual void event(const TraceEvent & event);

// Called when an event is committed

virtual void postEvent(const TraceEvent & event);

// Called when a kernel is finished. There will

// be no more events for this kernel.

virtual void finish();};

8

8

Page 9: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

TraceEvent ObjectCaptures execution of a dynamic instruction, including

Device pointer to access device state Kernel grid, CTA dimensions, parameter values PC and internal representation of PTX instruction Set of active threads executing instruction Memory addresses and size of transfer Branch target(s) and diverging threads

9

9

Page 10: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

trace::TraceEvent

class TraceEvent {public:

// ID of the block that generated the event ir::Dim3 blockId;

// PC index into EmulatedKernel's packed instruction sequence

ir::PTXU64 PC;

// Depth of call stack [i.e. number of contexts on the runtime stack]

ir::PTXU32 contextStackSize;

// Instruction pointer to instruction pointed to by PC const ir::PTXInstruction* instruction;

// Bit mask of active threads that executed this instruction

BitMask active; // Taken thread mask in case of a branch BitMask taken;

// Fall through thread mask in case of a branch BitMask fallthrough;

// Vector of memory addresses possibly generated for this instruction

U64Vector memory_addresses;

// Vector of sizes of memory operations possibly issued by this instruction

ir::PTXU32 memory_size;

// Dimensions of the kernel grid that generated the event

ir::Dim3 gridDim;

// Dimensions of the kernel block that generated the event

ir::Dim3 blockDim; // Captures just events related to thread reconvergence ReconvergenceTraceEvent reconvergence;

};

Page 11: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

PTX Emulator – CUDA Debugging – Race Detection

// file: raceCondition.cu__global__ void raceCondition(int *A) {

__shared__ int SharedMem[64];

SharedMem[threadIdx.x] = A[threadIdx.x];

// no synchronization barrier!

A[threadIdx.x] = SharedMem[64 - threadIdx.x]; // line 9 - faulting load}

. . .

raceCondition<<< dim3(1,1), dim3(64, 1) >>>( validPtr );

. . .

==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z13raceConditionPi" with exception:

==Ocelot== [PC 15] [thread 0] [cta 0] ld.shared.s32 %r14, [%r13 + 252]

- Shared memory race condition, address 0xfc was previously written by thread 63 without a memory barrier in between.

==Ocelot== Near raceCondition.cu:9:0

11

11

Page 12: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

PTX Emulator – CUDA Debugging- Illegal Memory Accesses

// file: memoryCheck.cu__global__ void badMemoryReference(int *A) {

A[threadIdx.x] = 0; // line 3 - faulting store}

int main() {

int *invalidPtr = 0x0234; // arbitrary pointer does not refer // to an existing memory allocation

int *validPtr = 0; cudaMalloc((void **)&validPtr, sizeof(int)*64);

badMemoryReference<<< dim3(1,1), dim3(64, 1) >>>( invalidPtr ); return 0;

} ==Ocelot== Ocelot PTX Emulator failed to run kernel "_Z18badMemoryReferencePi" with exception: ==Ocelot== [PC 5] [thread 0] [cta 0] st.global.s32 [%r4 + 0], %r0 - Global memory access 0x234 is not within any allocated or mapped range.==Ocelot== ==Ocelot== Nearby Device Allocations==Ocelot== [0x12fa2e0] - [0x12fa3e0] (256 bytes)==Ocelot== ==Ocelot== Near memoryCheck.cu:3:0

12

12

Page 13: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Interactive DebuggerInteractive command-line debugger implemented using TraceGenerator interface

$ ./TestCudaSequenceA_gpu = 0x16dcbe0(ocelot-dbg) Attaching debugger to kernel 'sequence'(ocelot-dbg)(ocelot-dbg) watch global address 0x16dcbe4 s32[3]set #1: watch global address 0x16dcbe4 s32[3] - 12 bytes(ocelot-dbg)(ocelot-dbg) continuest.global.s32 [%r11 + 0], %r7watchpoint #1 - CTA (0, 0) thread (1, 0, 0) - store to 0x16dcbe4 4 bytes old value = -1 new value = 2 thread (2, 0, 0) - store to 0x16dcbe8 4 bytes old value = -1 new value = 4 thread (3, 0, 0) - store to 0x16dcbec 4 bytes old value = -1 new value = 6break on watchpoint(ocelot-dbg)

• Enables • Inspection of application

state • Single-stepping of

instructions• Breakpoints and

watchpoints

• Faults in MemoryChecker and RaceDetector invoke ocelot-dbg automatically

Page 14: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Performance Tuning

....for (int offset = 1; offset < n; offset *= 2){ // line 61

pout = 1 - pout;pin = 1 - pout;__syncthreads();temp[pout*n+thid] = temp[pin*n+thid];

if (thid >= offset) { temp[pout*n+thid] += temp[pin*n+thid-offset];

}}....

• Identify critical regions• Memory demand• Floating-point intensity• Shared memory bank

conflicts

Page 15: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example: Inter-thread Data Flow

• Which kernels exchange computed results through shared memory

• Track id of producer thread

• Ensure threads are well synchronized

15

15

Page 16: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Current Trace Generators16

Trace Generators FunctionBranch Measures control flow uniformity and

branch divergenceInstruction Static and dynamic instruction countIntegrated Debugger GDB-like interfaceKernel Dimension Kernel grid and block dimensionsMachine Attributes Observe and record machine characteristics

Memory Working set size, memory intensity, memory efficiency

Memory Checker i) Bounds checks, ii) alignment checksiii) uninitialized loads (shared memory)

Memory Race Detector Race conditions on shared memoryParallelism MIMD and SIMD parallelism limitsPerformance Bound Compute and memory throughputShared Computation Extent of data flow among threads

Warp Synchronous Hot-paths/regions for warp synchronous execution

16

Page 17: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Using Trace Generators

Implement TraceGenerator interface

override methods: initialize( ), event( ), postEvent( ), finish( )

Add to Ocelot runtime explicitly: ocelot::addTraceGenerator( )

or, add to trace::TraceConfiguration

Online analysis or serialize event traces

Link applications with libocelotTrace.so

17

17

Page 18: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Configuring & Executing Applications Edit configure.ocelot

Controls Ocelot’s initial state Located in application’s startup directory trace specifies which trace generators are initially

attached executive controls device properties

trace: memoryChecker – ensures accesses to a memory

region associated with the currently selected device

raceDetector - enforces synchronized access to .shared

debugger - interactive debugger executive:

devices: emulated – Ocelot PTX emulator (trace generators)

trace: { memoryChecker: { enabled: true, checkInitialization: false }, raceDetector: { enabled: true, ignoreIrrelevantWrites: true }, debugger: { enabled: true, kernelFilter:

"_Z13scalarProdGPUPfS_S_ii", alwaysAttach: false }, }, executive: { devices: [ "emulated" ], }}

18

Page 19: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY

Example: Thread Load Imbalance

//! Computes number of dynamic instructions for each thread

class ThreadLoadImbalance: public trace::TraceGenerator {

public:

std::vector< size_t > dynamicInstructions;

// For each dynamic instruction, increment counters of each

// thread that executes it

virtual void event(const TraceEvent & event) {

if (!dynamicInstructions.size())

dynamicInstructions.resize(event.active.size(), 0);

for (int i = 0; i < event.active.size(); i++) {

if (event.active[i])

dynamicInstructions[i]++;

}

}

};

• Mandelbrot (CUDA SDK)

Dyna

mic

Inst

ruct

ion

Coun

t

19

19

Page 20: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 20

EXAMPLE: CONTROL-FLOW DIVERGENCE

Page 21: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 21

Control-Flow Divergence in PTX Emulator PTX Emulator facilitates customizable handlers for

control flow divergence

Currently implements: Immediate post-dominator (ipdom) And many more

Abstract handlers implement potentially divergent control instructions

e.g. Bra, Bar, Exit, Vote

Executing instructions drives TraceGenerators Reconvergence affects active threads, dynamic instruction

count, and instruction trace Analysis tools can group threads into warps of arbitrary size

Page 22: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 22

class ReconvergenceMechanism// ocelot/executive/interface/ReconvergenceMechanism.h

class ReconvergenceMechanism {public: ReconvergenceMechanism(CooperativeThreadArray *cta); virtual ~ReconvergenceMechanism();

virtual void initialize() = 0;

virtual void evalPredicate(CTAContext &context) = 0;

virtual bool eval_Bra(CTAContext &context, const ir::PTXInstruction &instr, const boost::dynamic_bitset<> & branch, const boost::dynamic_bitset<> & fallthrough) = 0;

virtual void eval_Bar(CTAContext &context, const ir::PTXInstruction &instr) = 0;

virtual void eval_Reconverge(CTAContext &context, const ir::PTXInstruction &instr) = 0;

virtual void eval_Exit(CTAContext &context, const ir::PTXInstruction &instr) = 0;

virtual void eval_Vote(CTAContext &context, const ir::PTXInstruction &instr);

virtual bool nextInstruction(CTAContext &context, const ir::PTXInstruction &instr, const ir::PTXInstruction::Opcode &) = 0; virtual CTAContext& getContext() = 0;}

Page 23: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 23

Example: Immediate Post-Dominator Reconvergence// ocelot/executive/interface/ReconvergenceMechanism.h

class ReconvergenceIPDOM: public ReconvergenceMechanism {public: ReconvergenceIPDOM(CooperativeThreadArray *cta); ~ReconvergenceIPDOM(); void initialize(); void evalPredicate(CTAContext &context); bool eval_Bra(CTAContext &context, PTXInstruction &instr, dynamic_bitset<> & branch, dynamic_bitset<> & fallthrough);

void eval_Bar(CTAContext &context, PTXInstruction &instr);

void eval_Reconverge(CTAContext &context, PTXInstruction &instr);

void eval_Exit(CTAContext &context, PTXInstruction &instr);

bool nextInstruction(CTAContext &context, PTXInstruction &instr, PTXInstruction::Opcode &);

CTAContext& getContext(); size_t stackSize() const; void push(CTAContext&); void pop(); std::vector<CTAContext> runtimeStack; std::vector<int> pcStack; unsigned int reconvergeEvents;};

Page 24: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 24

Example: Immediate Post-Dominator Reconvergence

// ocelot/executive/implementation/ReconvergenceMechanism.cpp

void executive::ReconvergenceIPDOM::eval_Bar(executive::CTAContext &context, const ir::PTXInstruction &instr) { if (context.active.count() < context.active.size()) { // deadlock - not all threads reach synchronization barrier std::stringstream message; message << "barrier deadlock:\n"; throw RuntimeException(message.str(), context.PC, instr); }}

Page 25: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 25

Example: Immediate Post-Dominator Reconvergencevoid executive::ReconvergenceIPDOM::eval_Reconverge( executive::CTAContext &context, const ir::PTXInstruction &instr) { if(runtimeStack.size() > 1) { if(pcStack.back() == context.PC) { pcStack.pop_back(); runtimeStack.pop_back(); ++reconvergeEvents; } else { context.PC++; } } else { context.PC++; }}void executive::ReconvergenceIPDOM::eval_Exit(executive::CTAContext &context, const ir::PTXInstruction &instr) { eval_Bar(context, instr); context.running = false;}

Page 26: Ocelot: PTX Emulator

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY 26

Summary: Ocelot PTX Emulator Used for instruction level analysis

Can be attached to trace generators and trace analyzers

Simple processing and filtering of machine state

Forms the basis for a range of productivity tools Correctness tools Debugging tools Workload characterization

Drives instruction and address traces to MacSim