TinySPICE: A Parallel SPICE Simulator on GPU for Massively

TinySPICE: A Parallel SPICE Simulator on GPU forMassively Repeated Small Circuit Simulations

Lengfei HanDepartment of ECE

Michigan Tech. UniversityHoughton, MI, [email protected]

Xueqian ZhaoDepartment of ECE

Michigan Tech. UniversityHoughton, MI, 49931

[email protected]

Zhuo FengDepartment of ECE

Michigan Tech. UniversityHoughton, MI, 49931

[email protected]

ABSTRACTIn nowadays variation-aware IC designs, cell characterizations andSRAM memory yield analysis require many thousands or even mil-lions of repeated SPICE simulations for relatively small nonlinearcircuits. In this work, we present a massively parallel SPICE simu-lator on GPU, TinySPICE, for efficiently analyzing small nonlinearcircuits, such as standard cell designs, SRAMs, etc. In order to gainhigh accuracy and efficiency, we present GPU-based parametricthree-dimensional (3D) LUTs for fast device evaluations. A seriesof GPU-friendly data structures and algorithm flows have been pro-posed in TinySPICE to fully utilize the GPU hardware resources,and minimize data communications between the GPU and CPU.Our GPU implementation allows for a large number of small circuitsimulations in GPU’s shared memory that involves novel circuitlinearization and matrix solution techniques, and eliminates mostof the GPU device memory accesses during the Newton-Raphson(NR) iterations, which enables extremely high-throughput SPICEsimulations on GPU. Compared with CPU-based TinySPICE sim-ulator, GPU-based TinySPICE achieves up to 138X speedups forparametric SRAM yield analysis without loss of accuracy.

Categories and Subject DescriptorsB.7.2 [Design Aids]: simulation—Integrated Circuits

General TermsPerformance, Algorithms, Verification

KeywordsVariation-aware analysis, SPICE simulation, GPU computing

1. INTRODUCTIONReliability and yield analysis of embedded SRAM memory mod-

ules are critical to designs of modern microprocessors, 3D-ICs,and mixed-signal SOCs. However, nanoscale SRAM designs aregreatly challenged by prohibitively high computational cost due tothe extremely large number of repeated SPICE simulations consid-ering parametric variations [1–4]. Additionally, present-day variation-aware design methodologies require extremely fast cell/driver char-acterization capability capturing important process, voltage sup-ply, and temperature (PVT) variations [5, 6], which also demands

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.DAC’13, May 29 - June 07 2013, Austin, Tx, USA.Copyright 2013 ACM 978-1-4503-2071-9/13/05 ...$15.00.

for much more powerful simulation methodologies. For instance,SRAM readability, writability and stability analysis consideringthreshold voltage (Vth), effective channel length (Leff ) and powersupply variations requires tens of millions of repeated SPICE simu-lations for a given design, while variation-aware cell modeling andcharacterizations also involve constructing look-up tables (LUTs)for capturing all fast/slow corners that requires running many thou-sands of SPICE simulations [5–7].

Recent multiprocessors with heterogeneous architectures haveemerged as mainstream computing platforms, which typically in-tegrate a variety of processing elements of different computing per-formance, programming flexibility and energy efficiency character-istics. Heterogeneous computing platforms, such as IBM/Sony Cellarchitectures, nowadays personal computers (PCs) with multi-coreCPUs and many-core GPUs, and the latest mobile heterogenousmicroprocessors (e.g. APU from AMD [8], Tegra from Nvidia [9],etc), can theoretically provide unprecedented high performance andhigh energy efficiency in the meantime. With such heterogeneouscomputing architectures, VLSI CAD developers will have good op-portunities to fully utilize these unique computing resources, therebytargeting much greater performance and energy efficiency.

Although there have been works that target accelerating SPICEsimulations by performing device evaluations on GPU’s hundredsof streaming processors and sparse matrix solves on CPU [10], onlya small fraction of the computations can be accelerated on GPU,while the overall simulation performance is still limited by the rel-atively low communication bandwidth and large latency betweenthe CPU and GPU. As a result, only 2X speedups have be ob-tained when compared with the CPU-based SPICE simulator [10].Although a GPU-based LU algorithm has been recently proposedto efficiently factorize sparse circuit matrices in [11], it is still notclear how to accelerate the entire computations involved in general-purpose SPICE simulations on GPU considering present-day GPUcomputing limitations.

In this work, we present a massively parallel SPICE simulator onGPU, TinySPICE, which accelerates the entire SPICE simulationcomputations on GPU without introducing excessive CPU-GPUdata communications and device memory accesses. TinySPICEcan analyze small nonlinear circuits in GPU’s shared memory, asshown in Fig. 1 and thus gains unprecedentedly high computa-tional throughput. We develop novel GPU-friendly data structuresand efficient algorithm flow for every kernel function of the SPICEalgorithm that includes device evaluations, matrix construction, lin-ear system solving and Newton-Raphson (NR) iterations. A se-ries of novel techniques also have been proposed in TinySPICE tomore efficiently utilize the GPU hardware resources such as on-chip shared memory and registers, and optimize GPU memory ac-cesses. TinySPICE is capable of solving thousands of small cir-cuit simulation problems in GPU’s shared memory concurrently,and achieves unprecedented high-performance massively parallelSPICE simulations on GPU. Compared with CPU-based TinySPICE

bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

wordN

d bitbitbitbitbitbb bbbbb_bbb_bitbitbitbit__

wwwoorrdrr

bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

wordNN22

A

P1P1 P2P2N4

P1 P2

NNAAAA

dd PPPbit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word

33

A_A_b

33A_bA_bbbA_A

33NN

3N1 N33AA

N4N4N4

N1N1

NN222NN22AAAAAAAA2222P1P1P1P1PPPP1P1P1P1PP

AA

P2P2P2P2P2P2N4N4

N1N1

N2N1N12AAAAN1N1N1N1N1N1N122P1PAAPP1A_A_PP1

AN1N1N1N

P2

N3N3

N4

bA_bb

bit bit_b

N1

N2P1

A

P2

N3

N4

A_b

word

1i 2i

2v1v3i

kgs

dskm v

ig

kds

dskds v

iG

1,1 1,4 1,5

2,2 2,3 2,6

3,2 3,3 3,8

4,1 4,4

5,1 5,5 5,7

6,2 6,6 6,8

7,5 7,7

8,3 8,6 8,8

a a aa a aa a a

a aa a a

a a aa a

a a a

SP SPSP SPSP SPSP SP

SP SPSP SPSP SPSP SP

Shared Memory

Shared Memory SSSSSS

SSSSSSSS

MMMMMMMMMMM PPPPPPPPPPPPPPP SSPSSPSSSSPSSSSPSSPSSP PPPPPPPPPPPP SSSSSSSSSSSSSPPPPP SPPPPPPPPPPPP SSPSSPSSSSSSSSSSP PPPPPPPPP SSSSSSSSSSSP SP

SP SPSP SP

PPPPPPPPPPPPPPP SSSPSSSPSSSSSSSPSSSSPSSSP PPPPPPPPPPPPPPPPPP SPPPPPPPPPPP SSSPSSSPSSSSSSSSSSP PPPPPPPPPPP SP

SP SPSP SP

MMeMMMeMMMMeMMMeMMemomomomoryryryryyy

Shared Memory

SSSSSSSSSS

MMeMMMeMMMeMMMMeMemomomomoryryryryyy

Shared Memory SSSSSS

SSSSSSSS

MMMM PPPPPPPPPPPPPPPP SSSSSSSSSSSSSSSSSSSSP PPPPPPPPPPPPPP SSSSSSSSSSSSPPPPPP SPPPPPPPPPPPPPPP SSSSSSSSSSSSSSSSSP PPPPPPPPPPP SSSSSSSSSP

SP SPSP SP

PPPPPPPPPPPPPPPPPP SSSSSSSSSSSSSSSSSSSSSP PPPPPPPPPPPPPPP SPPPPPPPPPPPPPPP SSSSSSSSSSSSSSSSSP PPPPPPPPPPPP SP

SP SPSP SP

MMeMMMeMMMMMeMMMMeMMemomomomoryryryryyy

Shared Memory

SSSSSSSSSSSSS

MMeMMMeMMMMMeMMeeMM momomomoryryryryyy

Shared Memory PPP

PPPPP

ShS ara ededMMMM

ShS ara eded MMMMM PPPP SSSPPPPSSSSPPSSSPSSSSPSPSP PPP SSSSPPSSSPSSSSSSSSSPPP SP

PP SSSPPPPSSSSPSSSSSSSPSPSP PP SSSSPPSSSSSSSSSSSSP SPSP SPSP SP

PPPP SSSPPPPSSSSPPPSSSPSSSSSPSPSP PPPP SPPP SSSPPPPSSSSPSSSSSSSSPSPSP PP SP

SP SPSP SP

MMMeMeeeMeeeMeeeemommmomoryryry

Shared Memory

SSSSSSSSSSSSSS

MMeMeeeeeemomomoryryry

Shared Memory

CKT Schematic Model Evaluation Matrix Stamping/Solve

Converged? Return

Newton Raphson Iterations

GPU Streaming Multiprocessors Massively Parallel SPICE Simulations in GPU Shared Memory

Figure 1: TinySPICE: massively parallel SPICE simulation program on GPUs

TinySPICECPU Setup TinySPICE

Parse Netlist

Set up 3D LUT Excitation & S M i fSe up 3 UFor Nonlinear

Device

Excitation &MOS TerminalMap Index

Set up Matrix forLinear Circuit

Map Index@Texture Mem.

G Matrix & RHS@Texture Mem.

GPU Setup

GPU LUT@Texture Mem.

@Texture Mem. @Texture Mem.

G Matrix & RHS@Shared Mem.

Map Index@Shared Mem.

GPU AnalysisGPU AnalysisReset G Matrix& Update RHSReset G Matrix& Update RHS

TransistorE l tiTransistorE l ti

Reset G Matrix& Update RHSReset G Matrix& Update RHS

TransistorE l iTransistorE l i

Get LatestSolutionGet LatestSolution

EvaluationEvaluation

NonlinearStamp

NonlinearStamp Get Latest

SolutionGet LatestSolution

EvaluationEvaluation

NonlinearStamp

NonlinearStamp

…

LU SolveLU Solve

Conv.Conv.

LU SolveLU Solve

Conv.Conv.CheckCheck

Return

Conv.CheckConv.Check

Return

Figure 2: The algorithm flow of TinySPICE.

implementation, TinySPICE achieves up to 138X speedups for avariety of circuit analysis problems without loss of accuracy.

2. OVERVIEW OF TINYSPICETinySPICE is a SPICE-accurate nonlinear circuit simulator that

leverages many-core GPU for fast repeated small circuit analy-sis (see the attached supplementary materials for more details ofnonlinear circuit simulation algorithms and GPU computing). Toleverage the powers of GPU’s hundreds of streaming processors(SP) for data parallel computing, we propose a novel massivelyparallel algorithm with GPU-friendly data structures to acceleratethe repeated small nonlinear circuit simulations. Since memoryconsumption for analyzing such small circuits is usually very low,dense modified nodal analysis (MNA) matrix structures [12] anddirect solution methods, such as LU factorization technique, have

been applied in this work without sacrificing efficiency and accu-racy.

It is also important to note that by running each circuit simula-tion with one GPU thread, many thousands of independent SPICEsimulations can be executed on GPU concurrently. Moreover, sinceonly a small memory space is required for each circuit simulation,the data required by each circuit simulation can be fully stored inGPU’s shared memory, as illustrated in Fig. 1. Since data accessesfrom GPU’s device memory may take much longer time than load-ing data directly from GPU’s shared memory, shared memory hasbeen carefully utilized in this work to achieve much faster simula-tions and extremely high computing throughput. As a result, ourTinySPICE engine can run in a much faster way on GPU than con-ventional SPICE simulators running on CPUs, especially for mas-sively repeated Monte Carlo small circuit simulations.

3. TINYSPICE ON GPUIn this section, the proposed TinySPICE simulator on GPU for

massively parallel small circuit simulations will be described indetails. The TinySPICE algorithm flow includes three key steps:the CPU-setup phase, the GPU-setup phase and the GPU-analysisphase, as illustrated in Fig. 2, where G denotes the system MNAmatrix and RHS stands for right hand side vector.

3.1 CPU-Setup PhaseThe main task of the CPU-setup phase is to set up GPU-friendly

data structures for SPICE simulations on GPU, such as the look-uptables (LUTs) for nonlinear devices, the indices mapping vectors,the linear-element matrices, the right hand side (RHS) vectors andthe solution vectors.

3.1.1 Parametric 3D LUTs for Device EvaluationsFor small circuits, due to the very limited number of nodes, de-

vice evaluations typically dominate SPICE simulation cost. In or-der to more effectively parallelize device evaluations on GPU dur-ing circuit simulations, 3D LUTs for evaluating transistors havebeen adopted in this work (see the supplementary sections for moredetails). Compared with standard BSIM4 model evaluations whichinvolve more complicated computationss, device evaluations using3D LUTs require much less computational time. Such 3D LUTscan be efficiently built on CPU and then transferred to GPU beforeSPICE simulations start. To capture the impact of process varia-tions, such as effective channel length Leff and threshold voltageVth variations, parameterized 3D LUTs are constructed to facilitatefast variation-aware SPICE simulations, as described in the supple-mentary materials.

3.1.2 Storage of Circuit ElementsIt should be noted that, in order to obtain the stamping locations

of nonlinear elements in the system MNA matrix, it is necessary to

store the terminal indices of each nonlinear device. In this work, wepropose to store terminal indices of all transistors into a long index-mapping vector, as shown in Fig. 3. In the Mos_map vector, Idxstands for the corresponding LUT storage index for a transistor.d,g,s, and b represent the indices of each transistor’s terminals inthe MNA matrix, respectively. With such mapping information,device evaluation results from the LUTs can be directly written intothe system MNA matrix as well as the RHS vector.

Since the elements of linear devices such as resistors, capacitorsand inductors have fixed values, their corresponding stamped el-ements in the MNA matrix will not change throughout the entireSPICE simulation. Consequently, they can be pre-evaluated andstored into a linear-element matrix. During each NR iteration, oncethe linear-element matrix is combined with the nonlinear-elementmatrix obtained from nonlinear device evaluations, direct solutionmethods, such as LU factorization, can be applied to compute thesolutions.

Since GPU’s data parallel computing scheme is not suitable forprocessing sparse matrices, our TinySPICE simulator adopts densematrix structure in this work. We emphasize that for small circuitsimulations, the memory consumption and computational cost forstoring and processing the dense MNA matrices is still acceptable.

3.1.3 Summary of CPU-Setup PhaseWe conclude the CPU-setup phase for TinySPICE as follows:

• TinySPICE first builds parametric 3D LUTs for all transis-tors according to a user-defined accuracy level. A suitablediscretization step size can be selected based on the circuitdesign information and specific simulation requirements: moreaccurate LUTs typically require greater memory space andtransistor characterization time. The parametric 3D LUTsare stored in a long 1D vector (as shown in Fig. 3) that willbe transferred to GPU’s device memory once before the mas-sively parallel SPICE simulations start.

• TinySPICE creates the 1D terminal index-mapping vectorsMos_map to store the node indices for all nonlinear devices,as shown in Fig. 3. Mos_map is used to help stamp nonlin-ear devices into the system MNA matrices, which has to beconstructed and transferred to GPU for one time.

• The linear-element matrices will also be stored in a 1D vec-tor and sent to GPU memory. Once GPU kernel functions arelaunched, linear-element matrices will be loaded into GPU’sshared memory at the initial step and will be combined withthe nonlinear-element matrices to form the final MNA matri-ces for subsequent NR iterations.

• TinySPICE creates the V S_map vectors including informa-tion on all excitation sources such as voltage and currentsources. Node a and node b denote regular terminal nodes.In modified nodal analysis (MNA), each voltage source re-quires including a Pseudo node into the index-mapping vec-tor for representing the current flowing through the device,as shown in Fig. 3.

• Similarly, TinySPICE creates V S_step vectors including allthe values of time-varying voltage and current sources at eachtime step of transient simulations. V S_step will be thencombined with constant excitation vector to form the finalRHS vectors.

3.2 GPU-Setup and GPU-Analysis PhasesThe main task of the GPU-setup phase is to prepare proper sim-

ulation environment for the subsequent circuit analysis on GPU,which includes device memory allocations, and data transmission

Gmat… 3D LUT

MOSFET 1… n

Gmatvth… GmatLeff…… … …

MOSFET 1… n

MOSFET 1… n

Idx Type d g s b … Mos_map

OS …MOSFET n

PWL Node a … VS_map

MOSFET 1 … MOSFET n

Node bPseudoNode P

V_t1 V_t2 … VS_step... V_tn

Voltage Source 1 Voltage Source n

Voltage Source 1 Voltage Source n

Figure 3: Vectors for storing LUTs, Mosfets, and excitationsources on GPU.

from host (CPU) to device (GPU). CUDA devices have severaltypes of memories that exhibit different data access latencies andbandwidths which may greatly influence the GPU kernel executionperformance (see the supplementary materials for more details).

3.2.1 GPU Memory UsagesTinySPICE has been designed to carefully utilize GPU’s on-chip

memory resources as follows.

• Read-only GPU memory: The parametric 3D LUTs, linear-element matrices, PWL voltage source values and RHS vec-tors are read-only for all GPU threads. Therefore, it is pre-ferred to store them in GPU’s read-only texture memory,such that the memory data access latency can be effectivelyreduced.

• Read-write GPU memory: The index-mapping vectors fornonlinear devices and voltage sources are shared among allthe circuits, thus they need to be stored in GPU’s sharedmemory. If GPU’s shared memory is not sufficient, suchmapping vectors can also be stored in GPU’s registers. Ad-ditionally, nonlinear-element matrices and RHS vectors arestored in GPU’s registers since their values are changing fre-quently.

3.2.2 GPU Data Organization for Coalesced DeviceMemory Access

In order to obtain the best parallel GPU computing performance,coalesced device memory (global memory) accesses should be sat-isfied for all GPU threads. Since the solution vectors will not bevery frequently accessed during NR iterations, but only be usedfor storing and sending results back to the host. Consequently, westore them in GPU’s global memory. Since TinySPICE works onmany repeated circuit simulation tasks with different input excita-tions and circuit parameters, each solution vector stored on GPU’sglobal memory can be reused for many consecutive simulations onGPU’s shared memory.

To enable coalesced device memory accesses, we organize thememory storage of solution vectors in such a way, that for all ncircuits, the memory space of all the n solution vectors are contin-uous, as shown in Fig. 4, where Tk denotes the k-th GPU thread,and xi.m denotes the m-th element of the solution vector of circuiti. This GPU-friendly data storage obviously allows for efficient

X1 0 X2 0 Xn 0 X1 n X2 n Xn n X

X[0] X[n]

X1.0 X2.0 … Xn.0 X1.n X2.n … Xn.n…

T1 T2 Tn

X

…

Figure 4: The solution vector data access pattern on GPU.

coalesced global memory accesses, which can significantly reducethe GPU device memory access overhead.

3.2.3 GPU Thread Organization for TinySPICESince each GPU’s streaming multiprocessor (SM) has very lim-

ited memory resources, the number of circuits to be analyzed atthe same time should be carefully determined based on the circuitsizes and on-chip memory usage (e.g. Nvidia GeForce GTX480GPU has 15 SMs, each of which includes 48k shared memory and32k registers). The limited memory can impact the number of GPUthreads (SPICE simulations) running on each SM. To achieve thebest simulation performance, TinySPICE first finds out the optimalthread block sizes and grid sizes by evaluating simple memory-cost functions (for computing the maximum number of circuits thatcan be analyzed in one SM). Subsequently, the proper thread orga-nization and assignment are determined, and the final simulationcode can be compiled for a given circuit design. It is worth not-ing that different circuit analysis problems may result in differentGPU thread settings, and therefore different speedups compared toCPU-based SPICE simulations.

3.3 Algorithm Flow for TinySPICEThe algorithm flow of TinySPICE is summarized in Algorithm 1.

At the beginning of each NR iteration, TinySPICE evaluates allnonlinear devices (linearizes the system) using LUT-based trilinearinterpolations according to the latest solution results. After the de-vice evaluations, the computed elements for nonlinear devices arestamped into nonlinear-element matrices based on the terminal in-dices stored in the index-mapping vectors. RHS vectors also needto be updated based on the latest solution results. The final MNAmatrices can be created by combining the nonlinear-element matri-ces with the linear-element matrices that has been previously builtand stored in GPU’s texture memory. Subsequently, GPU-basedLU decomposition algorithm is applied to factorize the MNA ma-trices.

Algorithm 1 Newton-Raphson (NR) Iteration Algorithm Flow onGPU

Allocate system MNA matrix and RHS in registers for each GPU thread.Load linear-element matrix, RHS vectors, index-mapping vectors fromGPU’s texture memory to shared memory.for i = 1→ n NR iterations do

1. Reset system MNA matrix and RHS vector by loading initial datafrom shared memory.2. Evaluate nonlinear devices.3. Stamp system MNA matrix and compute the RHS vector.4. Factorize system MNA matrix of each circuit and solve for thesolution vector.5. Apply a damping factor for the solution if needed.

end forif NR does not converge then

Perform another n iterations of steps 1-5.end ifReturn solution if NR converged. Otherwise return an error flag.

It should be noted that, in order to reduce GPU thread divergenceconsidering GPU’s single-instruction-multiple-thread (SIMT) scheme,

Table 1: Experimental setup of test cases. "NL_Num" denotesthe number of nonlinear devices, "Node_Num" represents thenumber of nodes in the circuit, "Vs_Num" represents the num-ber of independent voltage sources, and "Unk_Num" denotesthe number of unknowns of the nonlinear system.

Circuit NL_Num Node_Num Vs_Num Unk_Num6T-SRAM 6 8 5 12D-Latch 8 9 5 13

D-Flip-Flop 16 12 5 16Invertor-Chain 32 20 3 22

4:1 Mux 24 27 9 35

the convergence condition is not checked during every NR step.Instead, we check the convergence after several NR iterations. Al-though this method will result in some overhead, it may efficientlyreduce the divergence of GPU threads.

4. EXPERIMENT RESULT

4.1 Experimental SetupIn this work, several widely used digital circuits have been tested

using TinySPICE on GPU. To demonstrate the benefit of our GPU-based TinySPICE simulator, traditional CPU-based SPICE simula-tion methods and TinySPICE on CPU are implemented and evalu-ated. Detailed characteristics of test cases are summarized in Ta-ble 1. We set up both the first-order and second-order parametric3D LUTs in our experiments. These LUTs have been tested usingdifferent resolutions. Throughout the following experiments, weuse a high LUT resolution to guarantee that the final solution ofTinySPICE is matching the SPICE solution. Under the high reso-lution, the first-order LUTs totally cost 27MB memory for a singletransistor, while using the second-order LUTs will double the mem-ory cost1. It should be also noted that, in the experimental results,the 3D LUTs setup time is not included. The average time for gen-erating the first-order LUTs for a single transistor is around 0.435s,while including the second-order LUTs will double the setup time.Compared with the whole SPICE simulation runtime, the 3D LUTssetup time is typically much smaller. Furthermore, the LUTs gen-eration process can be easily parallelized using multi-core CPUs toreduce the LUTs setup time. Since the accuracy level with first-order parametric LUTs is very satisfactory in our experiments, allthe following experiments results are obtained based on the first-order LUTs to reduce the memory and runtime cost. All experi-ments have been performed on Ubuntu8.04 64-bit with 2.66GHzquad-core CPU, 6GB DRAM memory, and one Nvidia GeForceGTX480 GPU with 1.5GB device memory.

4.2 Experimental Results

4.2.1 Accuracy of Parametric 3D LUTA static random access memory (SRAM) cell is simulated to

show the accuracy of the parametric 3D LUTs. For each test, wesweep the input from 0 to V DD. At each sweeping point, 1000ΔVth and ΔLeff variation parameters are generated randomly andindependently for each transistor following a normal distribution.For each normal-distribution parameter, 10% of the nominal valueis set to be the standard deviation σ. 1000 circuit DC simulationsare performed. CPU-based SPICE simulator using the originalBSIM4 model evaluations generates the reference results, and are

1Our second-order LUTs for transistors have neglected cross-termimpacts to reduce the LUT characterization time and memory, asdescribed in the supplementary materials.

Table 2: Runtime results of TinySPICE for 1.5M Monte CarloDC simulations. “CPU-LUT" denotes the runtime for LUT-based SPICE simulation on CPU, “CPU BSIM4" denotes theruntime for SPICE simulation with BSIM4 models on CPU,“GPU-LUT" denotes the runtime for proposed TinySPICEon GPU. Speedups are calculated by comparing to the “CPUBSIM4"

Circuit CPU BSIM4(s) CPU LUT(s) GPU LUT(s)6T-SRAM 768.153 403.200(1.9X) 2.902(264X)D-Latch 1212.979 527.155(2.3X) 5.727(211X)

D-Flip-Flop 2027.827 982.579(2.1X) 10.677(189X)Invertor Chain 4377.600 1981.440(2.2X) 41.863(104X)

4:1 Mux 3686.400 1812.480(2.0X) 81.366(45X)

Table 3: Runtime results of TinySPICE for 1.5M Monte CarloTR simulations. “CPU-LUT" denotes the runtime for LUT-based SPICE simulation on CPU, “CPU BSIM4" denotes theruntime for SPICE simulation with BSIM4 models on CPU,“GPU-LUT" denotes the runtime for proposed TinySPICEon GPU. Speedups are calculated by comparing to the “CPUBSIM4"

Circuit CPU BSIM4(s) CPU LUT(s) GPU LUT(s)6T-SRAM 30720 7679.82(4.0X) 163.59(187X)D-Latch 41472 10751.95(3.9X) 186.42(222X)

D-Flip-Flop 69120 18432.15(3.8X) 341.3(202X)Invertor Chain 121344 41472(2.9X) 755.76(160X)

4:1 Mux 256512 33792.15(7.6X) 2658.3(96X)

compared with TinySPICE simulators implemented for CPU andGPU computing platforms.

Fig. 5 shows the I-V characteristics of an nMOS transistor. Inthe figure, asterisks represent the I-V characteristics obtained us-ing BSIM4 model evaluations and the circles represent the resultsobtained using parametric 3D LUTs. As observed, the results ob-tained from parametric LUTs are very close to the results generatedusing BSIM4 models. In our experiment, several different Vgs val-ues are chosen, such as 0.3,0.7,1.0, to validate the model accuracy.

Fig. 6 demonstrates the DC simulation results (for an internalnode voltage) of the parametric SRAM analysis. The solid line inred is the base line. The results show that our TinySPICE simula-tor matches well with the original SPICE simulator, and can cap-ture the parametric variations accurately. The average relative erroris measured as 0.29%. The second-order LUTs have also beentested for DC simulation. The average relative error has dropped to0.289%.

4.2.2 Runtime ResultsFirst, we show the DC and transient simulation runtime results

of our TinySPICE tool by comparing them with the results obtainedby CPU-based simulators. The runtime results of all simulators areobtained by running 1, 536, 000 simulations of different circuitswith different excitations and circuit design parameters.

As observed in Table 2 and Table 3, CPU-based SPICE simu-lator using LUTs can achieve up to 2X speedups for DC simu-lations and 7X speedups for transient simulations when comparedwith traditional SPICE simulator “CPU BSIM4". The reason is thatthe device evaluation cost for parametric 3D LUTs interpolation ismuch cheaper than the evaluation of BSIM4 models. Moreover,compared to CPU-based SPICE simulator using LUTs, when per-forming DC simulations using TinySPICE on GPU, we can achieveup to 138X speedups. TinySPICE on GPU runs up to 264X fasterthan traditional SPICE simulator “CPU BSIM4". For transient sim-ulations, TinySPICE on GPU runs up to 222X faster than the tra-

0 0.2 0.4 0.6 0.8 1 1.2 1.4−2

0

2

4

6

8

10

12

14

16x 10

−5

Drain to source voltage Vds[v]

Dra

in c

urr

ent

Ids[

A]

vgs=0.3vgs=0.7vgs=1.0vgs=0.3vgs=0.7vgs=1.0

Figure 5: The I-V characteristics obtained by parametric 3DLUT and Bsim4 model evaluations. Circles denote the LUTevaluation results.

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

SPICE (V)

Tin

ySP

ICE

(V

)

Figure 6: Scatter plot of the DC simulation results for SRAMcircuits obtained by TinySPICE and the original Bsim4 SPICEsimulator.

Runtime(log)

6

7264 X

138 X211 X

92X189 X

92X

104 X47 X 22 X

45 X

4

5

b i3

cpu bsim4cpu lutgpu lut

1

2gp

06TSRAM D Latch D Flip Flop Inverter

Chain4 input Mux

Chain

Figure 7: Comparison of DC Simulation Runtime

Runtime(log)

5

6

187 X 222 X 202X 160 X54 X 12 X

96 X

4

b i

187 X46X 57X 53X 12 X

2

3 cpu bsim4cpu lutgpu lut

1

gp

06TSRAM D Latch D Flip Flop Inverter

Chain4 Input Mux

Figure 8: Comparison of Transient Simulation Runtime

4.5(Log)

3

3.5

4

2

2.5

3

memory use

1

1.5

2 yspeedup

0

0.5

1

6T SRAM D Latch D Flip Flop inverterchain

4 input Mux

Figure 9: Memory usage (shared memory + registers) v.s.speedups

ditional SPICE simulator (as shown in Fig. 8).It should be noted that, once the circuit problem size increases,

the memory consumption of each GPU thread will also increase.As a result, the total number of GPU threads will decrease dueto the limited GPU on-chip memory resources, such as registersand shared memory. For instance, the “4:1 Mux" test case has35 unknowns, and the speedups obtained by GPU is only 22Xin DC simulations, as illustrated in Fig. 7. This corresponds to amuch lower simulation performance on GPU than the result ob-tained from the “6T-SRAM" circuit that has only 12 unknowns.

In the following, the relationship between the runtime speedupsand GPU on-chip memory consumption (shared memory and reg-isters) will be analyzed. As illustrated in Fig. 9, the blue curvedenotes the memory usage for different circuits, and the red curvedenotes the speedups of GPU-based SPICE simulator using LUTsobtained by comparing it with CPU-based SPICE simulator usingLUTs. We observe that, when the number of unknowns of a circuitincreases linearly, the memory consumption will dramatically in-crease, which is due to the storage requirement of the dense MNAmatrices. Obviously, for each GPU thread, the dominant on-chipGPU memory is consumed for storing the system MNA matrices.Since the total registers available for each SM is very limited, oncemore on-chip memory is consumed by a single GPU thread, muchfewer GPU threads can be assigned onto a GPU’s SM. As a result,the GPU computing resources may not be fully utilized or theremay not exist enough active GPU threads, which in turn dramati-cally reduces the runtime speedups.5. CONCLUSIONS

In this work, we present a massively parallel SPICE-accurate

nonlinear circuit simulation engine, TinySPICE, for variation-awareembedded memory and standard cell analysis by leveraging theemerging parallel GPU computing platforms. By accelerating theentire flow of SPICE simulation algorithm on GPU’s on-chip mem-ory, such as shared memory and registers, and employing paramet-ric 3D LUTs, SRAM yield analysis and standard cell variation-aware characterizations can be performed in a much faster way thanever before. Compared with standard CPU-based SPICE simula-tion engines, our extensive experimental results show that TinySPICEsimulation engine achieves up to 264X speedups for parametricSRAM simulations without sacrificing solution accuracy. AlthoughTinySPICE is especially designed for small circuit simulations (withless than 20 unknowns), and its performance can be limited byGPU’s on-chip memory resources for larger circuit analysis prob-lems, TinySPICE still achieves up to 22X speedups in simulatinga circuit with 35 unknowns.

6. REFERENCES[1] R. Kanj, R. V. Joshi, and S. R. Nassif. Mixture importance

sampling and its application to the analysis of SRAMdesigns in the presence of rare failure events. In Proc.IEEE/ACM DAC, pages 69–72, 2006.

[2] A. Bansal, R. N. Singh, R. Kanj, S. Mukhopadhyay, J. Lee,E. Acar, A. Singhee, K. Kim, C. Chuang, S. R. Nassif,F. Heng, and K. K. Das. Yield estimation of SRAM circuitsusing "Virtual SRAM Fab". In Proc. IEEE/ACM ICCAD,pages 631–636, 2009.

[3] J. Wang, S. Yaldiz, X. Li, and L. T. Pileggi. SRAMparametric failure analysis. In Proc. IEEE/ACM DAC, pages496–501, 2009.

[4] J. Wang, A. Singhee, R. A. Rutenbar, and B. H. Calhoun.Two Fast Methods for Estimating the Minimum StandbySupply Voltage for Large SRAMs. IEEE Trans. onComputer-Aided Design, 29(12):1908–1920, 2010.

[5] C. Amin, C. Kashyap, N. Menezes, K. Killpack, andE. Chiprout. A multi-port current source model formultiple-input switching effects in CMOS library cells. InProc. IEEE/ACM DAC, pages 247–252, 2006.

[6] P. Li, Z. Feng, and E. Acar. Characterizing MultistageNonlinear Drivers and Variability for Accurate Timing andNoise Analysis. IEEE Trans. on Very Large Scale Integration(VLSI) Systems, 15(11):1205–1214, 2007.

[7] N. Menezes and C. V. Kashyap and C. S. Amin. A "true"electrical cell model for timing, noise, and power gridverification. In Proc. IEEE/ACM DAC, pages 462–467, 2008.

[8] AMD Corporation. AMD FusionZ Family of APUs:Enabling a Superior, Immersive PC Experience. AMDwhitepaper, [Online]. Available:http://sites.amd.com/us/fusion/apu/Pages/fusion.aspx, 2011.

[9] Nvidia Corporation. Bringing High-End Graphics toHandheld Devices. Nvidia whitepaper, 2011.

[10] K. Gulati, J. F. Croix, S. P. Khatri, and R. Shastry. Fastcircuit simulation on graphics processing units. In Proc.IEEE/ACM ASPDAC, pages 403–408, 2009.

[11] L. Ren, X. Chen, Y. Wang, C. Zhang, and H. Yang. SparseLU factorization for parallel circuit simulation on GPU. InProc. IEEE/ACM DAC, pages 1125–1130, 2012.

[12] L. Pillage, R. Rohrer, and C. Visweswariah. Electroniccircuit & system simulation methods. McGraw-Hill, 1995.

[13] Nvidia CUDA programming guide. [Online]. Available:http://www.nvidia.com/object/cuda.html, 2007.

[14] Nvidia Corporation. Fermi compute architecture whitepaper. [Online]. Available:http://www.nvidia.com/object/fermi_architecture.html, 2010.

Supplementary MaterialS.1 Nonlinear Circuit Simulation ApproachesGeneral nonlinear electronic circuit simulation techniques rely onNR method to solve the following nonlinear differential equations[12]:

f (x (t)) +d

dtq (x (t)) + u (t) = 0, (1)

where f(·) and q(·) denote the static and dynamic nonlinearities,x(t) is a vector including nodal voltages as well as branch cur-rents, and u(t) is the input excitation vector. Sophisticated nu-merical methods can be used to solve the above nonlinear differ-ential equations by first linearizing the nonlinear circuit system ata given solution point, and subsequently solving the correspond-ing linear matrix problems. For instance, after linearizing the sys-tem, conductance matrix G

�xk�= δf

δx

��xk

and capacitance ma-

trix C�xk�= δq

δx

��xk

can be easily obtained which are typicallyasymmetric matrices. The dominant computational cost for solv-ing small circuit problems is mainly due to the nonlinear deviceevaluations, while for much larger circuits solving the asymmetricJacobian matrices using direct solution method can be much moreexpensive due to the exponentially increased runtime and memorycost.

S.2 Massively Parallel GPU ComputingS.2.1 Recent GPUsThe recent Nvidia Geforce GTX 285 GPU includes 30 streamingmultiprocessors (SMs) and each SM has eight streaming proces-sors (SPs) that share the same instruction unit as well as the 32 KBon-chip shared memory, as shown in Fig. 10. According to CUDAprogramming model [13], 32 threads are formed into a warp, andwill execute the same instruction every four clock cycles, result-ing in a very light overhead (one instruction issuing is followed by32 thread executions). When a kernel function is launched on GPU,the task (data) is further divided into many thread blocks (1D, 2D or3D) based on the problem size and available on-chip hardware re-sources. Each thread block may include multiple warps of threads.Subsequently, each SM will work on a few thread (data) blockswith its eight SPs. Recent Fermi GPU from Nvidia has increasedthe number of streaming processors (SPs) in each streaming multi-processor (SM) from 8 to 32, boosting the total number of stream-ing processors to 512 [14]. The new GPU model also supportshigh performance double-precision computing and concurrent ker-nel executions. Up to 16 kernels can be launched concurrently onthe 16 SMs for Fermi GPUs, while in previous GPU architecturesonly one kernel can be launched at the same time on GPU, whichallows for more flexible and efficient GPU computing.

S.2.2 Key Issues in Efficient GPU ComputingGPU’s on-chip memory (shared memory and registers) is very fast,but the available on-chip memory resource can be quite limited,whereas the off-chip device memory (global memory) is sufficientlylarge but can be much slower than on-chip memories. Addition-ally, coalesced GPU global memory accesses are important sincerandom memory accesses are typically much slower. The devicememory bandwidth can be up to 100Gb/s if accessed in a coa-lesced pattern but may also reduce to 10X lower if accessed in arandom manner [13]. If random memory access is needed, texturememory on GPU (like the L1 and L2 caches for CPU) should beused, though a good memory access pattern is still desired such thatthreads of a warp can access the neighboring memory locations.

GPU’s hardware and software properties impose the followingchallenges when developing streaming data parallel computing al-gorithms: (1) the dependencies among different tasks (data) should

SP SP

Instruction Fetch/DispatchInstruction L1 Data L1

Streaming Multiprocessor (SM30)


Streaming Multiprocessor (SM2)

I i L1 D L1

Streaming Multiprocessor (SM1)GlobalM

emor

GTX 285 GPUH

SPSPSP

SP

SPSPSP

SP

SFU

Shared Memory

SFU

SPSPSP

SP

SPSPSP

SP

SFU

/ p

Shared Memory

SFU

SPSPSP

SP

SPSPSP

SP

SFU


SFU

ry(DRAM

)

HostMem

ory177GB/SP SPShared Memory

/sBW

Figure 10: The GTX 285 GPU architecture.

be minimized, (2) excessive global data sharing and shared mem-ory (register) bank conflicts should be avoided, (3) the arithmeticintensity that is defined as the number of floating point operationsper data reading/writing should be maximized, and (4) the algo-rithm control flow should be simplified.

S.3 Parametric 3D Look-up TablesIn order to meet requirements in both accuracy and runtime effi-ciency, parametric 3D LUT models will be constructed for evaluat-ing transistors during circuit simulations. LUT-based evaluation ofa smooth function derived from the truncated Taylor expansion canbe formulated as follows:

Td (x) = f (c) +

d�

k=1

f (k) (c)

k!(x− c)k (2)

where f (c) denotes the evaluation function and f (k) (c) denotesthe k-th order derivatives at reference point c, x is the evaluationpoint, and d is the degree of the Taylor polynomial. The approxi-mated evaluation can be carried out by looking up a precalculatedLUT for coefficients associated with (x− c)k. For the second-order Taylor polynomial expansion, which means d = 2, we canget the second-order parametric 3D LUTs evaluation function:

LUT = LUTbase + LUTVth ·ΔVth + LUTLeff ·ΔLeff

+LUTVth2 ·ΔV 2th + LUTLeff2 ·ΔL2eff

+ΔVth ·ΔLeff · LUTVthLeff

(3)

where LUTbase represents the base LUT generated based on thetransistor nominal parameters. LUTVth and LUTLeff are the first-order coefficient LUTs for transistor threshold voltage and effectivechannel length respectively. Similarly LUTVth2 and LUTLeff2

are the second-order coefficient LUTs. LUTVthLeff is the coeffi-cient LUT derived from the partial derivatives of Vth and Leff . Inorder to reduce the complexity, this cross-term is ignored in ourimplementation. ΔVth and ΔLeff denote the variations of thethreshold voltage and effective channel length. So the base LUTand two coefficient LUTs compose the whole parametric 3D LUTsof a transistor. The number and order of coefficient LUTs can beadjusted according to the number of input parameters and accu-racy requirement. However it is not always necessary to introducethe higher order LUTs for each parameter. Benefited from theseparametric LUTs that can capture the variations of transistor pa-rameters, we do not need to update the LUTs for every parametricSPICE simulation. In other words, only one-time data transferring

of the parametric LUTs from CPU to GPU is required, which cangreatly reduce the overhead of CPU-GPU communications.

The proposed TinySPICE first parses standard SPICE-like circuitnetlist, and evaluates the BSIM4 transistor models to build para-metric 3D LUTs for all nonlinear transistors. When building theparametric 3D LUTs, we use the ΔVth,ΔLeff , Vds, Vgs and Vbs

as the input variables, where Vds, Vgs and Vbs denote the terminalvoltages of MOSFET devices. To get coefficient LUTs, LUTVth

, LUTVLeff,LUTVth2 and LUTVLeff2

are also calculated aftergenerating the LUTbase. The parametric 3D LUTs outputs includeall the required elements for stamping the conductance and capaci-tance matrices obtained from linearizing (1) during SPICE simula-tions, such as conductance, capacitance, currents and charges.

After extracting all the data required by these parametric 3DLUTs using thousands of BSIM4 model evaluations, we store allthe data into a long vector to allow GPU’s coalesced device mem-ory accesses, as shown in Fig. 3. Considering the huge amount ofdata (more than forty elements) computed in one transistor evalua-tion, we store the data in such a way that good data locality can bewell preserved to ensure GPU’s efficient texture memory accessesduring LUTs’ trilinear data interpolations using neighboring eightpoints.

Since device evaluations using 3D LUTs are based on eight-point trilinear data interpolations, device evaluated by LUTs re-quires much less computational time than the BSIM4 model eval-uations that involve very complex formulas. We observe that formost digital circuit modeling and analysis applications, the accu-racy level obtained using LUT-based SPICE simulator (with first-order parametric LUTs) is very satisfactory, though for analog cir-cuits the convergence may become more difficult.

Documents

TinySPICE: A Parallel SPICE Simulator on GPU for Massively