Towards Acceleration of Fault Simulation Using
Graphics Processing Units
Kanupriya Gulati Sunil P. Khatri
Department of ECETexas A&M University, College Station
Outline
Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions
Outline
Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions
Introduction Fault Simulation (FS) is crucial in the VLSI design flow
Given a digital design and a set of vectors V, FS evaluates the number of stuck at faults (Fsim) tested by applying V
The ratio of Fsim/Ftotal is a measure of fault coverage
Current designs have millions of logic gates The number of faulty variations are proportional to design size Each of these variations needs to be simulated for the V vectors
Therefore, it is important to explore ways to accelerate FS
The ideal FS approach should be Fast Scalable & Cost effective
Introduction We accelerate FS using graphics processing units (GPUs)
By exploiting fault and pattern parallel approaches
A GPU is essentially a commodity stream processor Highly parallel Very fast Operating paradigm is SIMD (Single-Instruction, Multiple Data)
GPUs, owing to their massively parallel architecture, have been used to accelerate Image/stream processing Data compression Numerical algorithms
LU decomposition, FFT etc
Introduction We implemented our approach on the
NVIDIA GeForce 8800 GTX GPU By careful engineering, we maximally harness the GPU’s
Raw computational power and Huge memory bandwidth
We used the Compute Unified Device Architecture (CUDA) framework Open source C-like GPU programming and interfacing tool
When using a single 8800 GTX GPU card ~35X speedup is obtained compared to a commercial FS tool Accounts for CPU processing and data transfer times as well
Our runtimes are projected for the NVIDIA Tesla server Can house up to 8 GPU devices ~238X speedup is possible compared to the commercial engine
Outline
Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions
GPU – A Massively Parallel Processor
Source : “NVIDIA CUDA Programming Guide” version 1.1
GeForce 8800 GTX Technical Specs. 367 GFLOPS peak performance for certain applications
25-50 times of current high-end microprocessors Up to 265 GFLOPS sustained performance
Massively parallel, 128 SIMD processor cores Partitioned into 16 Multiprocessors (MPs)
Massively threaded, sustains 1000s of threads per application
768 MB device memory 1.4 GHz clock frequency
CPU at ~4 GHz
86.4 GB/sec memory bandwidth CPU at 8 GB/sec front side bus
1U Tesla server from NVIDIA can house up to 8 GPUs
Outline
Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions
CUDA Programming Model The GPU is viewed as a compute device that:
Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel
Host Device
DeviceMemory
Kernel Threads (instances of
the kernel)PCIe
(CPU) (GPU)
CUDA Programming Model Data-parallel portions of an application are executed on
the device in parallel on many threads Kernel : code routine executed on GPU Thread : instance of a kernel
Differences between GPU and CPU threads GPU threads are extremely lightweight
Very little creation overhead GPU needs 1000s of threads to achieve full parallelism
Allows memory access latencies to be hidden Multi-core CPUs require fewer threads, but the available
parallelism is lower
Thread Batching: Grids and Blocks A kernel is executed as a grid of
thread blocks (aka blocks) All threads within a block share
a portion of data memory
A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution
For hazard-free common memory accesses
Efficiently sharing data through a low latency shared memory
Two threads from two different blocks cannot cooperate
Host
Kernel 1
Kernel 2
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Grid 2
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Source : “NVIDIA CUDA Programming Guide” version 1.1
Block and Thread IDs
Threads and blocks have IDs So each thread can identify
what data they will operate on Block ID: 1D or 2D Thread ID: 1D, 2D, or 3D
Simplifies memoryaddressing when processingmultidimensional data Image processing Solving PDEs on volumes Other problems with underlying
1D, 2D or 3D geometry
Device
Grid 1
Block(0, 0)
Block(1, 0)
Block(2, 0)
Block(0, 1)
Block(1, 1)
Block(2, 1)
Block (1, 1)
Thread(0, 1)
Thread(1, 1)
Thread(2, 1)
Thread(3, 1)
Thread(4, 1)
Thread(0, 2)
Thread(1, 2)
Thread(2, 2)
Thread(3, 2)
Thread(4, 2)
Thread(0, 0)
Thread(1, 0)
Thread(2, 0)
Thread(3, 0)
Thread(4, 0)
Source : “NVIDIA CUDA Programming Guide” version 1.1
Device Memory Space Overview
Each thread has: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant
memory Read only per-grid texture
memory
The host can R/W global,
constant and texture memories
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
Source : “NVIDIA CUDA Programming Guide” version 1.1
Device Memory Space Usage Register usage per thread should be
minimized (max. 8192 registers/MP)
Shared memory organized in banks Avoid bank conflicts
Global memory Main means of communicating
R/W data between host and device
Contents visible to all threads Coalescing recommended
Texture and Constant Memories Cached memories Initialized by host Contents visible to all threads
(Device) Grid
ConstantMemory
TextureMemory
GlobalMemory
Block (0, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
LocalMemory
Thread (0, 0)
Registers
LocalMemory
Thread (1, 0)
Registers
Host
Source : “NVIDIA CUDA Programming Guide” version 1.1
Outline
Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions
Approach We implement a Look up table (LUT) based FS
All gates’ LUTs stored in texture memory (cached) LUTs of all library gates fit in texture cache
To avoid cache misses during lookup Individual k-input gate LUT requires 2k entries Each gate’s LUT entries are located at a fixed offset in the
texture memory as shown above Gate output is obtained by
accessing the memory at the “gate offset + input value” Example: output of AND2 gate when inputs are ‘0’ and ‘1’
0 1 2 30
Approach
In practice we evaluate two vectors for the same gate in a single thread 1/2/3/4 input gates require 4/16/64/256 entries in
LUT respectively
Our library consists of an INV and 2/3/4 input AND, NAND, NOR and OR gates.
Hence total memory required for all LUTs is 1348 words This fits in the texture memory cache (8KB per MP)
We exploit both fault and pattern parallelism
Approach – Fault Parallelism
All gates at a fixed topological level are evaluated in parallel.
Primary Inputs
Primary Outputs
1 2 3 L
Fault Parallel
Logic Levels →
Approach – Pattern ParallelismPattern Parallel
Simulations for any gate, for different patterns, are done In parallel, in 2 phases
Phase 1 : Good circuit simulation. Results returned to CPU Phase 2 : Faulty circuit simulation. CPU does not schedule a stuck-at-v fault in a
pattern which has v as the good circuit value. For the all faults which lie in its TFI
Fault injection also performed in parallel
Faulty Good
vector1
vectorvector2N
Good circuit value
for vector 1
Faulty circuit value
for vector 1
Approach – Logic Simulation
typedef struct __align__(16){int offset; // Gate type’s offsetint a, b, c, d; // Input valuesint m0, m1; // Mask variables} threadData;
Approach – Fault Injection
typedef struct __align__(16){int offset; // Gate type’s offsetint a, b, c, d; // Input valuesint m0, m1; // Mask variables} threadData;
m0 m1Meaning
- 11 Stuck-a-1 Mask
11 00 No Fault Injection
00 00 Stuck-at-0 Mask
Approach – Fault Detectiontypedef struct __align__(16){int offset; // Gate type’s offsetint a, b, c, d; // input valuesint Good_Circuit_threadID; // Good circuit simulation thread ID} threadData_Detect;
3
Approach - Recap CPU schedules the good and faulty gate evaluations.
Different threads perform in parallel (for 2 vectors of a gate) Gate evaluation (logic simulation) for good or faulty vectors Fault injection Fault detection for gates at the last topological level only
We maximize GPU performance by: Ensuring no data dependency exists between threads issued in
parallel Ensuring that the same instructions are executed by all threads,
but on different data Conforms to the SIMD architecture of GPUs
Maximizing Performance
We adapt to specific G80 memory constraints LUT stored in texture memory. Key advantages are :
Texture memory is cached Total LUT size easily fits into available cache size of 8KB/MP No memory coalescing requirements Efficient built-in texture fetching routines available in CUDA Non-zero time taken to load texture memory, but cost easily
amortized
Global memory writes for level i gates (and reads for level i+1 gates) are performed in a coalesced fashion
Outline
Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions
Experimental Setup FS on 8800 GTX runtimes compared to a commercial
fault simulator for 30 IWLS and ITC benchmarks.
32 K patterns were simulated for all 30 circuits.
CPU times obtained on a 1.5 GHz 1.5 GB UltraSPARC-IV+ Processor running Solaris 9.
OUR time includes Data transfer time between the GPU and CPU (both directions)
CPU → GPU : 32 K patterns, LUT data GPU → CPU : 32 K good circuit evals. for all gates, array Detect
Processing time on the GPU Time spent by CPU to issue good/faulty gate evaluation calls
ResultsCircuit #Gates #Faults OURS (s) COMM. (s) Speed Up
s9234_1 1261 2202 2.043 26.740 13.089
s35932 10537 24256 7.883 265.590 33.691
s5378 1682 3543 1.961 31.950 16.290
s13207 1594 3032 0.656 52.590 80.160
:
b22 34060 55077 58.33 252.040 4.167
b17_1 51340 120639 14.84 736.670 41.232
b10 407 767 0.340 4.020 11.834
b02 52 3114 0.028 1.280 45.911
Avg (30 Ckts.) 34.879
Computation results have been verified. On average, over 30 benchmarks, ~35X speedup obtained.
Results (IU Tesla Server)Circuit #Gates #Faults PROJ. (s) COMM. (s) Speed Up
s9234_1 1261 2202 0.282 26.740 94.953
s35932 10537 24256 0.802 265.590 567.941
s5378 1682 3543 0.271 31.950 117.716
s13207 1594 3032 0.091 52.590 579.453
:
b22 34060 55077 57.969 252.040 4.348
b17_1 51340 120639 14.335 736.670 51.391
b10 407 767 0.051 4.020 78.494
b02 52 3114 0.003 1.280 367.288
Avg (30 Ckts.) 238.185
NVIDIA Tesla 1U Server can house up to 8 GPUs Runtimes are obtained by scaling the GPU processing times only Transfer times and CPU processing times are included, without scaling
On average ~240X speedup obtained.
Outline
Introduction Technical Specifications of the GPU CUDA Programming Model Approach Experimental Setup and Results Conclusions
Conclusions We have accelerated FS using GPUs
Implement a pattern and fault parallel technique
By careful engineering, we maximally harness the GPU’s Raw computational power and Huge memory bandwidths
When using a Single 8800 GTX GPU ~35X speedup compared to commercial FS engine
When projected for a 1U NVIDIA Tesla Server ~238X speedup is possible over the commercial engine
Future work includes exploring parallel fault simulation on the GPU
Thank You