A parallel direct method circuit simulator based on sparse matrix partitioning

A parallel direct method circuit simulator based on sparsematrix partitioning

K.Y. Wu, P.K.H. Ng, X.D. Jia *, R.M.M. Chen, A.M. Lay®eld

Department of Electronic Engineering, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong

Abstract

Solving a system of linear simultaneous equations representing an electrical circuit is one of the mosttime consuming tasks for large scale circuit simulations. In order to facilitate a multiprocessorimplementation of the circuit simulation program SPICE, decomposition algorithms should be employedto partition a sparse matrix equation of the overall circuit into a number of subcircuit equations forparallel processing. In this paper, the performance of a parallel direct method matrix equation solvingroutine was studied in several contexts: the theoretical lower bound on performance was derived and thetradeo� between parallelism and communication is presented; various implementation and performancetuning issuing is also reported. This routine is written in such a manner that the data structure iscompatible with SPICE Version 3c1. The speedup obtained from the simulation of two test circuits on amessage passing multiprocessor system built on Transputers will be reported. Finally, the factorsa�ecting the performance of the multiprocessor system are outlined and the overheads a�ecting thesystem performance in the implementation are identi®ed. # 1998 Elsevier Science Ltd. All rightsreserved.

Keywords: Circuit simulator; Sparse matrix partitioning; Sparse matrix equation; Parallel direct method

1. Introduction

Computer-aided circuit simulation is one of the most important and time consuming

computational tasks in VLSI circuit design. For VLSI circuit design, circuit simulation using

standard circuit simulators like SPICE [1], ASTAP [2], and SLATE [3] may require many days

on a computer system with one processor. Although some simulators based on relaxation

Computers & Electrical Engineering 24 (1998) 385±404

0045-7906/98/$19.00 # 1998 Elsevier Science Ltd. All rights reserved.PII: S0045-7906(98)00022-6

PERGAMON

* Corresponding author. Fax: +852-2788-7791; E-mail: [email protected].

methods [4, 5] can perform circuit simulation up to two order of magnitude faster than SPICE,convergence can not be guaranteed for all class of circuits.Solving the large sparse matrix equations (SMEs) representing the circuit equations is one of

the most time consuming tasks for large scale circuit simulation. By employing parallelprocessing technique, the runtime of circuit simulation can be reduced [6±8].In order to facilitate a multi-processor implementation of circuit simulation based on direct-

method [1], circuit equation decomposition technique should be used to partition the overalllarge sparse linear system into a number of subsystems. In our work, the matrix decompositionalgorithm proposed by Chen [9] was employed to decompose the sparse matrix equation intoseveral submatrix equations, of which computation can be arranged for parallel processing ona cost-e�ective multiprocessor system. The speed-up achieved by the decomposition algorithm,the e�ect of communication overhead and the relative merits of di�erent multiprocessortopologies was previously estimated by a multiprocessor simulator and reported in Ref. [10].In this paper, various implementation issues of the parallel SME solving routines will be

reported. The routines were written in such a manner that the data structure is fullycompatible with SPICE Version 3c1. The hardware platform used in this work is a messagepassing multiprocessor system based on transputers. The factors which a�ect the systemperformance will be identi®ed and some performance tuning aspects of the SME solvingroutines will be presented in detail. The organization of the paper is as follows: In Section 2,the decomposition algorithm proposed by Chen is brie¯y introduced. In Section 3, thetheoretical lower bound on performance was derived and the tradeo� between parallelism andcommunication is presented. In Sections 4 and 5, the implementation and performance tuningaspects of the algorithm on a multi-transputer platform was discussed. In Section 6, the speed-up obtained for the simulation of two test circuits were reported. Finally, discussion andconclusion were presented in Section 7.

2. Decomposition algorithm

The decomposition algorithm used in this work has been discussed in Refs. [9±11]; a briefoutline is included here.Let the overall circuit equation be written as

Ax � y �1�

where matrix A is a sparse matrix with dimension M. After the circuit is partitioned into Isubcircuits, the equations of the subcircuits can be rewritten in the following form:

yi � Bixi �Dizi �2�

where Bi is the circuit matrix of the subcircuit i with the dimension of N i�N i. Di is an N i� n i

circuit matrix, and zi is an n i vector obtained from x by retaining the elements correspondingto the columns of Di.

K.Y. Wu et al. / Computers & Electrical Engineering 24 (1998) 385±404386

The ith subcircuit solution can be obtained from Eq. (2) as

xi � �Bi �ÿ1� yi ÿDizi � �3�or

xi � vi � Eizi, �4�where

vi � �Bi �ÿ1yi �5�and

Ei � ÿ�Bi �ÿ1Di: �6�All items on the right-hand side of Eq. (4) are known except zi. Let z be an m-vector whose

elements consist of all the elements of zi, for i=1, 2, . . . , I, without any duplication in termsof the elements of x.It was shown in [9] that

z � Hz� w �7�or

z � �IÿH�ÿ1w, �8�where H and w can be computed from the elements of Ei and vi, i=1, 2, . . . , I. Eq. (8) iscalled the interconnection level equation. Once z is calculated from Eq. (8), all subcircuitsolutions can be computed independently from Eq. (4). Parallel computation can be arranged.

3. Lower bound on performance

In this section, the problem of optimal partitioning of a computation into parallel tasks isstudied. The model of parallel computation adopted is based on a directed acyclic graph.Several interconnection structures are studied: including the pipeline, the binary tree and the n-linear array (for simplicity, we refer it as `tree' in the following text). By taking into accountthe e�ciency and speedup obtained, the lower bound on the parallel execution time ispresented. The optimal point of partitioning that yields a minimum execution time for thepipeline and binary tree topology was also derived.The partitioning algorithm can be characterized by a weighted directed acyclic graph (DAG).

A DAG is represented by a tuple G=(V, E, C, T), where V={nj, j=1:v} is a set of nodesand v= vV v; E={ei,j=ni, nj} is the set of communication edges and e= vE v; C is the set ofcommunication costs and T is the set of node computation costs.E is the set of directed edges which de®ne a partial order or precedence constraints on V. A

directed edge (i, j) between two tasks i and j exists if there is a data dependency between them,which means that task j cannot start execution until its predecessor task has completed itsoperation. A weight cij is associated with each edge eij, where cij denotes the amount of

K.Y. Wu et al. / Computers & Electrical Engineering 24 (1998) 385±404 387

computation required from vi to vj. The value ti $ T is the computation cost for node ni $ V.An example of a weighted DAG with seven tasks is shown in Fig. 1.Consider the case of a diamond shape DAG as shown in Fig. 2, representing the situation

when a given computation is decomposed into n subproblems of equal size. Each of thesesubproblems can be executed as a separate, independent task on a separate processing element.By processing element, we refer to a combined communicating and execution unit, in whichcomputation and communication operations can be performed concurrently.To derive the lower bound on performance, the parallelism o�ered by the hardware is

assumed to be fully utilized. As such, the minimum extent of decomposition for a givenproblem should be equal to the number of processing elements n. The pattern of dependency isin the form of a diamond shape DAG. Such task graphs have only one entry node and oneexit node. The operation associated with the entry node is the distribution of the data to theprocessing elements. The operation associated with the exit node is the collection of the resultsof the decomposed computations from the processing elements.No superlinear speedup is o�ered by the partitioning algorithm and the problem is

partitioned into subproblems of equal size if possible. The communication functions fordi�erent interconnection topologies represent the total number of hops required on theinterconnection network when there are (nÿ1) independent tasks to distribute. It was assumed

Fig. 1.

Fig. 2.


that the bandwidth of the interconnection network is always adequate for serving the requestsgenerated by the processing elements. The lower bound on parallel execution time wasestablished on the basis that the interconnection network can only serve one communication atany time. Certainly, the parallel computation time can be greatly reduced if some of thecommunication operations can be completed in parallel.

Let T denote the time needed to execute a given problem on one processor system. If theproblem is partitioned evenly across n processors, the execution time of each subproblem willbe T/n. If the total overhead in time required for distributing the tasks is C, then the executiontime on a system with n processors is:

Execution time � T� Cf�n�n

, �9�

where f(n) is called the communication function indicating the number of communicationsrequired to communicate with all n processors; for n>1, this function depends on a giveninterconnection structure.

The discussion that follows is focused on a few interconnection structures used to connectprocessing elements and to implement a parallel machine. The type of parallel machine underconsideration consists of processing units connected among themselves, and is, in general,without a common memory. The interconnection topologies we considered in this paper areshown in Fig. 3.

The communication functions for these interconnection structures are as shown in Table 1,the proof is presented in Appendix A.

The parallel execution time for each of the above interconnection topologies are as follows.

Fig. 3. Several interconnection structures under consideration.

Table 1f(n) for various topologies

Topology Communication function

Pipeline fL(n)= n(nÿ1)/2Binary tree f BT(n)= (1/ln 2)[(n+1)ln(n+1)ÿ2nln 2]Tree f bdL(n)= b((d+1)d)/2


3.1. Pipeline of n-processors

tL�n� � T� CfL�n�n

: �10�

By di�erentiating the communication function of pipeline topology w.r.t. n and setting to zero,we can obtain the minimal execution time at the point

nopt ��2T

C

r:

3.2. Binary tree

The number of communications required for a binary tree of d levels are

fBT�d� � 2 �Xdi�0

ikiÿ1 �Xdi�0

i � 2i � 2�2d�dÿ 1� � 1�

where k=2 for a binary tree.The number of processors for a d-level binary tree system is

n �Xdi�0

2i � �2d�1 ÿ 1�

The execution time for a d-level binary-tree system with n processors is

tBT�d� � T� C � fBT�d�n

� T� C � fBT�d��2d�1 ÿ 1�

since

fBT�n� � 2 ��n� 1

2

�ln��n� 1�=2�

ln 2ÿ 1

�� 1

�� 1

ln 2��n� 1�ln�n� 1� ÿ 2n ln 2�

Fig. 4. The e�ect of T/C on speedup.


and

d

dntBT�n� � ÿ T

n2� C �

�nÿ ln�n� 1�

n2 � ln 2

��11�

for minimum tBT(n), Eq. (11) should equal to zero, we have

T

C��nÿ ln�n� 1�

ln 2

�:

3.3. Tree

The execution time for an n-processor system, where n= bd+1, is:

tbdL�n� � T� CfbdL�n�n

� T� C�b�d� 1�d �2�bd� 1�

For this case, the communication overhead is of O(d 2) for a particular b, but O(b) for aparticular d. Hence, it is more advantageous to expand the breadth of the system rather thanthe depth in order to obtain a higher speedup.From the above analysis, it is apparent that the computation to communication ratio is an

important consideration for performance. The plots in Fig. 4 also suggest that in order tomaintain the speedup with an increasing number of processors, the computation tocommunication ratio must also be increased.

4. Implementation on multi-transputer system

The parallel sparse matrix solving structure was implemented on the processor farmarchitecture connected in the pipeline form as shown in Fig. 5. The tasks allocation table ofthe program for the case when three processors are used to solve three submatrix equationswas as illustrated in Fig. 6.

Fig. 5. Farm architecture connected in pipeline form.


The controller of the system ®rst reads in the input ®les consisting of circuit matrices of thepartitioned subcircuits. They are stored as the SPICE sparse matrix data structure in theprogram. Depending on the number of workers available in the system, the submatrices alongwith the other data required to compute the results, are sent to the unoccupied workers. Theworkers then perform all subcircuits LU decomposition (LUD) and forward±backwardsubstitutions (FBS) operations of the subcircuits concurrently. The subcircuit results are sentvia another set of channels back to the controller. This procedure repeats until all submatrixequations are solved by LUD and FBS. By combining the results obtained, the H matrix andw vector are then formed. Finally by solving the H matrix equation Eq. (8) in the controllerthe solution of the inter-connections is then substituted into Eq. (4), the ®nal sets of results areobtained.Since the multi-transputer system is a message passing distributed memory system, each

transputer cannot directly access the memory located remotely, so data must be passed to theappropriate memory space through the transputer link. For every data transfer in the systemvia the external link, there is an associated communication protocol overhead. In order tominimize this communication overhead, the elements in each sparse matrix are grouped intosets of arrays for data transfer. The controller converts the circuit matrix from the sparse data

Fig. 6. Tasks allocation of the matrix equation solving program.


form into a set of arrays, passes these through the communication link, and reforms them backinto the sparse data structure form at the destination workers. This a�ects the system when thecomputation to communication ratio is low. For some cases, the overhead generated by thisconversion and reformation process can be over 40% of the overall simulation time. Therefore,this portion of program must be optimized in order to reduce the overhead.

5. System performance tuning

Two test circuits were chosen for the evaluation purpose. Each circuit was partitioned intoseveral subcircuits manually, then the subcircuits were read in by original SPICE to generateindividual submatrices. These submatrices were read in by the parallel SME solving programand solved by the decomposition algorithm as mentioned above. The time required to solve thesubmatrix equations were recorded. This test was repeated using di�erent numbers oftransputers in the system, and the timings collected are used for evaluating the performance ofthe system under di�erent conditions.The ®rst circuit is taken from [11]. The circuit consists of four pseudo op-amps, and the

model of the pseudo op-amp is an arti®cial linear network consisting of 40 resistors. This willavoid any convergence problem due to non-linear elements. Circuit II is similar to circuit I, buta di�erent model of the pseudo op-amp is used. Instead of a resistive network of fortyresistors, the number of resistors for the resistive network is increased to 180. This results in anoverall circuit matrix size of about 1600. After the tests have been carried out, by observing thetimings, the following remarks have been drawn:

1. The conversion and reformation of the sparse data structure involve a lot of looping andconsequently, high overheads.

2. The communication overhead associated with the system architecture is large, so it isnecessary to optimize this overhead by changing the communication mechanism.

3. Miscellaneous computation overheads such as, initialization, dynamic memory allocationmust be minimized.

In order to achieve the enhancements as mentioned previously, the following changes havebeen implemented:

. Changing the communication mechanism. As previously mentioned, one of the largestoverheads is the regeneration of the sparse matrix process which can block thecommunication link and cause a degradation in the overall performance. To overcome thisproblem, the new communication mechanism of the processor-farm operates as follows:instead of computing the data in `®rst-come-®rst served' manner, the worker will send thereceived n sets of data to its n neighbour workers and will compute the (n+1)th set of dataitself, as shown in Fig. 6. This mechanism avoids the communication block during theregeneration of data structure, and it also arranges the regeneration of di�erent sets of datastructure to occur concurrently.


. Bu�ering the data. The concurrent processes, routers and mergers in the workers areassigned to handle the data communication. To obtain an optimum performance, an extramemory bu�er is allocated at each router and merger.

. To reform the sparse data structure at each worker involves two phases of work:

to clear up the memory space used to store the submatrix previously assigned to theworker andto allocate new memory space for storing the matrix.The ®rst phase involves a lot of looping which are extremely time consuming. The clearing

and allocating memory processes are combined to limit this overhead. The contents of thesparse data structure are used to re®ll the memory space allocated for the previouscomputation. This can speedup the process dramatically.

. Since the size of each submatrix is not identical, the computation time for solving eachsubmatrix is di�erent. Although automatic load-balancing among worker processors isachievable for the processor-farm architecture, solving the submatrix equations with di�erentsequence may also produce di�erent overall computation time. The scheduling schemeemployed in the program is based on the largest-processing-time (LPT) technique [12]. Inbrief, this scheduling strategy is to assign the task whose execution time is largest withhighest priority. Estimation of the execution time can be accomplished by the size of eachsubmatrix.

. Tuning of compiling parameters can a�ect the memory management scheme in a transputer.There are 4 K bytes of on-chip internal memory in each T800 transputer which can be usedto store the run-time stack. Programs with size of stack exceeding 4 K bytes must useexternal memory instead, and this will result in a decrease in the speed of calculation. Thesize of the stack is optimized to ensure that it is placed in the on-chip memory, which gives aspeedup of performance.

. An alternative tree structure as shown in Fig. 7 is also built. This topology was chosen inview of its higher degree of connectivity for the controller processor. A version of the SMEsolving program is implemented to use this topology, and the performance of this topologyis compared with the pipeline version.

. When the workers are solving the SME concurrently, the controller is assigned to monitorand handle the inter-processor communications, and in some period of time, the controllermay remain idle. In order to fully utilize all the physical resources in the system, thecontroller can be assigned to perform calculation work as well, especially when there isonly a small number of processors. Since the controller can directly access its own memoryspace where the original matrix elements are stored, this can eliminate the communication

Fig. 7. Tree connection topology.


overheads and the sparse matrix data structure conversion and reformation overheadsassociated with that calculation. The system is also implemented in this manner and isde®ned as con®guration B.

6. Results

After the tuning process as mentioned in the previous section, the test circuits are used againto further evaluate the system performance, the results are shown in Tables 2 and 3.In Tables 2 and 3 the speedups are de®ned as following:

1. Speedup due to the decomposition, Sd. This is the speedup obtained from thedecomposition algorithm only, and can be calculated by

Sd � t1t1d

�12�

where t1=simulation time using one transputer no decomposition; t1d=simulation timeusing one transputer with decomposition. To access the speedup o�ered by thedecomposition algorithm only, the communication overheads have to be ignored in Eq. (12).This can be achieved by assigning all the computation to the controller processor (for bothdecomposition case and without decomposition case). This is the condition under which the`simulation time' is recorded.

2. Speedup due to multi-processor system, St0(Nt). This is the speedup obtained by adding

more processors into the system. This value is equal to

St0�Nt� � t1d

tNd, �13�

where t1d is the simulation time using one transputer with decomposition andt Nd=simulation time using Nt transputers with decomposition. Again, the numerator used

Table 2Speedups obtained pipeline connection topology and con®guration B

Sd St (1) St (2) St (3) St (4) St (5) So (1) So (2) So (3) So (4) So (5)

Circuit I 1.33 1.00 0.83 1.07 1.25 1.15 1.33 1.11 1.43 1.67 1.54Circuit II 1.13 1.00 1.73 2.10 3.48 3.69 1.13 1.96 2.38 3.95 4.18

Table 3Speedups obtained for tree connection topology and con®guration B

Sd St (1) St (2) St (3) St (4) St (5) So (1) So (2) So (3) So (4) So (5)

Circuit I 1.33 1.00 0.83 1.07 1.25 1.25 1.33 1.11 1.43 1.67 1.67Circuit II 1.13 1.00 1.73 2.10 3.49 3.82 1.13 1.96 2.38 3.96 4.32


in Eq. (13) is the simulation time using the controller for all the computation. St(n) is

de®ned as the speedup of a system consisting of n worker processors plus a controller, and

the controller does not participate in the parallel computation work as the farm workers do

(as the case of Fig. 8a). St0(Nt) is de®ned as the speedup of a system consisting of (Ntÿ1)

worker processors plus a controller, and the controller also participates in the parallel

computation work as the farm workers do (as the case of Fig. 8b).

3. Overall speedup, So0(Nt). This is the speedup contributed by the decomposition algorithm as

well as the multi-processor system. The value of the overall speedup can be obtained by

So0�Nt� � Sd � St

0 � t1tNd

�14�

For a small multiprocessor system, there is always a question of whether the controller should

be assigned for calculating the submatrix equations. A program was implemented in the

manner that the controller is assigned for the calculation when it is in idle state waiting to send

or receive. Conversely, another program was implemented so that the controller is not assigned

Fig. 8. Processes allocated to the controller in two di�erent con®gurations.

Fig. 9. Speedups obtained from test circuit II under di�erent conditions.


for calculating the submatrix equations. Fig. 8 shows the processes allocated to the controllerin these two programs, the speedups obtained from these two programs were then compared.The submatrices generated from the circuit II were used as input; the circuit was partitioned

into four subcircuits, and the size of the submatrices generated from the subcircuits are 404,400, 400 and 400. Fig. 9 shows the speedup curves obtained from the two con®gurations (asshown in Fig. 8) with either pipeline or tree connection topology. The dotted lines representthe speedups obtained from con®guration A, and the solid lines represent the speedupsobtained from con®guration B.First, let us study the speedups obtained from con®guration A. From the graph in Fig. 9, it

can be seen that So obtained from the pipeline topology increases almost linearly with thenumber of processors, which is not the situation as expected; Fig. 10 shows the task allocationfor Nt=3 (Fig. 10a) and Nt=4 (Fig. 10b), the diagram illustrates why the simulation timeand the speedups obtained for the two cases should be the same. The continuous increment inspeedup between Nt=3 and Nt=4 shows that there are some overheads associated with thesystem which a�ect the performance of the system when Nt=3.The curves show that the tree topology does not perform as the same as the pipeline

topology. In fact, the tree topology shows the expected e�ect in this problem: the speedupobtained when Nt=3 is nearly equal to the speedup gained when Nt=4. Since the onlydi�erence between the pipeline and the tree is the connection topology, but there is a di�erencein actual performance, this shows that the overheads associated with pipeline topology aremainly communication overheads.Now let us study the speedup factors obtained from con®guration B. By observing the

results, the di�erence between the two connection topologies is not as signi®cant as incon®guration A. The reason is because some of the calculations were assigned to the controller(see Fig. 10) which does not require the inter-processor communications, and hence thecommunication overhead is reduced. As mentioned before, the di�erence in performance

Fig. 10. Tasks allocation on (a) two worker processors P1 and P2, (b) three worker processors P1, P2 and P3.


between the two topologies is due to the communication overheads, therefore in con®gurationB, the di�erence is not as signi®cant as con®guration A.When the number of worker processors is equal to the number of subcircuits, the controller

in con®guration B only handles the inter-processor communication, and this is equivalent tocon®guration A, then there is a di�erence in the speedups obtained between the twotopologies. And in this situation, the tree topologies obtained a slightly better performance aspredicted.Note that for the case when Nt=3, the speedups from the tree topology in con®guration A

is better than the speedups in con®guration B, this is due to the following reason:From Fig. 11, the tasks allocation diagram for con®guration B shows that one of the worker

processors was assigned to compute two submatrix equations, which implies that this workerprocessor ®nished its ®rst calculation before the controller. Hence for some period of time, thecontroller was required to handle both communication and computation. As such, thecontroller was required to perform context switching between the calculation andcommunication. By assigning the calculation locally to the controller, the communication,conversion and reformation overheads can be eliminated. However, the overhead required forcontext switching is higher than the eliminated overheads in this case. Therefore, the overallspeedup obtained when Nt=3 is smaller than the overall speedup in con®guration A.

Fig. 11. Tasks allocation for the two con®gurations.


Also, it is worth to mention here that after a lot of testing, another factor which a�ect thespeed of the system has been found. When the allocated heap size in a transputer is increased,the speed of calculation on that transputer will be decreased. This fact is not mentioned in thetransputer ANSI C manual. This is probably due to the hardware architecture of transputersand/or the transputer C compiler. The speed reduction in the controller as mentioned in thelast paragraph may also be caused by this fact.After studying the speedups obtained from these two con®gurations, it was found that when

the number of worker processors is less than the number of subcircuits partitioned, the systemperformance is better for con®guration B. However, if the number of processors in the systemis substantially increased, it is better to dedicate the controller to handle only the inter-processor communications.

7. Discussion and conclusion

By observing all the other speedups obtained, the following factors a�ecting the performanceof the multiprocessor system were found:

. It can be seen that to obtain a better performance, the controller can be assigned to solvethe submatrix equation(s) as well. Usually, assigning the smallest submatrix (or submatrices)to the controller may improve the speedup as well as e�ciency to the system.

. As shown in the results, when all the subcircuits are of about similar sizes, usually themaximum speedup can be obtained when the total number of processors in the system(include controller) is equal to the number of subcircuits; this is due to the fact that all theprocessors are fully utilized and part of the communication and data reform overheads havebeen eliminated. However, if there is a large number of worker processors in the system, orthe computation to communication ratio is high, i.e. long calculation time is required, abetter speedup can be obtained if all the calculations are evenly assigned to the workerprocessors.

. For most of the cases, the speedups obtained from the pipeline do not di�er signi®cantlyfrom the tree topology for the con®guration where the controller is assigned for calculation.

. In the ®rst test circuit, the sizes of the subcircuits are not large, therefore the speedupsachieved are not as good as circuit II. If the size of each subcircuit is large (e.g. submatrixdimension is larger than 400), speedup close to the total number of processors can beachieved.

By summarizing the results, the overheads or the factors which directly a�ect the systemperformance in our implementation are:

. Communication overheads: These depend on the communication data size, the speed of thecommunication channel, and the tra�c in the system.

. Computational overheads: These include the sparse data structure conversion andreformation overheads and memory management overheads.


. Computation to communication ratio: If the data communication size is large, and the timefor computation is relatively small, no matter how many processors are available, themaximum speedup is limited. This ratio is directly a�ected by the submatrix size and thecomplexity of the problem, which depends on the sparsity of the submatrix and the amountof ®ll-in during LU decomposition [13].

. System resources: These include the number of processors (e.g. transputers) in the systemand the size of memory available for each processor.

. Method of circuit partition: Di�erent methods of circuit partition may give di�erent size ofthe submatrices generated, di�erent numbers of submatrices, and di�erent numbers ofinterconnections. Best speedup can be obtained when the size of the submatrices are aboutthe same, the number of submatrices is equal to the total number of processors in thesystem, and the number of inter-connections among the subcircuits is small.

. Complexity of problem. This includes:

1. the size of the circuit; in order to obtain a high speedup, the size of the problem i.e. the sizeof the simulated circuits must be large (e.g. the dimension of each submatrix generated isover 400).

2. the kind of analysis to be performed; whether only the real part of the results is required(e.g. dc operation point, dc transient analysis), or both the real part and the imaginary partof the results are concerned (e.g. ac analysis).

. System con®guration and connection topologies. These concern with the hardware as well asthe structure of parallel programming. Both factors will a�ect the performance of a multi-processor system, and can limit the maximum speedup of the system.

Appendix A. Proof for communication function

A.1. Pipeline of n processors

fL�n� � 1� 2� 3� L� �nÿ 1� �Xnj�1

j � �1� �nÿ 1��nÿ 1�2

� n�nÿ 1�2

A.2. Binary tree of d levels

The number of communications required for di�erent levels of binary tree are given inTable 4.

Hence, the number of communications required for a binary tree of d levels is

fBT�d� � 2Xdi�1

ikiÿ1 �Xdi�1

i � 2i � 2�2d�dÿ 1� � 1�,

where k=2 for binary tree.


A.3. Tree

There are b numbers of linear array of dimension d, the total number of processors is

n � bd� 1

This is similar to the case for the pipeline topology, i.e.,

fbdL�n� � b�d� 1�d

2

References

[1] Nagel, SPICE2: a computer program to simulate semiconductor circuits. Electroncs Research Laboratory,Report No. ERL-M520, University of California, Berkeley, 1975.

[2] Weeks T, Jimenez AJ, Mahoney GW, Mehata D, Qassemzadeh H, Scott TR. Algorithms for ASTAP: a net-

work analysis program. IEEE Trans. Circuit Theory 1973;CT-20(6):628±34.[3] Yang P, Hajj IN, Trick TN. SLAVE: A circuit simulation program with latence exploitation and node tearing.

Proc. IEEE Int. Conf. on Cir. and Comp., Oct. 1980.[4] Lelarasmee E, Ruehli AE, Sangiovanni-Vincentelli AL. The waveform relaxation method for time domain

analysis of large scale integrated circuits. IEEE Trans CAD IC Systems 1982;1(3):131±45.[5] White and Vincentelli AS. Relaxation Techniques for the Simulation of VLSI Circuits. Norwell, MA: Kluwer

Academic Publishers, 1986.

[6] Saleh RA et al. Parallel circuit simulation on supercomputers. Proc IEEE 1989;77(12):1915±31.[7] Nakata T et al. CENJU: a multiprocessor system for modular circuit simulation. Computing Systems in

Engineering 1990;1(1):101±9.

[8] Cox PF et al. Direct circuit simulation algorithms for parallel processing. IEEE Trans on Computer-AidedDesign 1991;10(6):714±25.

[9] Chen RMM. Solving a class of large sparse linear systems of equations by partitioning. Proc. IEEE Int. Symp.on Circuit Theory, Toronto, 1973, pp. 223±226.

[10] Lay®eld AM, Chen RMM, Ng PKH. Multiprocessor simulator evaluation of a circuit partitioning algorithmfor parallel execution of SPICE. Proc. 1992 European Simulation Multiconference, 1992, pp. 440±444.

[11] Lay®eld AM, Chen RMM, Lim LS, Siu WC. Implementation of multi-processor SPICE, Proc. 6th Int Forum

on CAD. EUROTEAM, University of Leicester, 1991, pp. 316±325.[12] Co�man Jr. EG, Denning PJ. Operating System Theory. Prentice Hall, 1976.[13] Gallivan KA et al. Parallel algorithms for matrix computations. SIAM, 1990.

Table 4No. of communication hops for binary tree topology

Levels No. of communications

1 22 2(1+2�2)3 2(1+2�2+3�4)4 2(1+2�2+3�4+4�8)* *dÿ1 2[1+2�2+3�4+4�8+ � � �+(dÿ1)2(dÿ2)]d 2[1+2�2+3�4+4�8+ � � �+ d�2(dÿ1)]


Authors' biographies

Calvin K.Y. Wu obtained his degree in Electrical and Electronic Engineering from University of

Portsmouth, England in 1990. He is currently a technical sta� in the Department of Electronic

Engineering, City University of Hong Kong, responsible for computer systems and networks. E-

mail: [email protected].

Paul K.H. Ng received M.Ph. degree in Electronic Engineering in 1994 from the City University o�

Hong Kong. He is currently working in the Motorola Australia Software Centre in the area of

ASIC design.


Jia Xingdong received the B.Sc. degree in 1983 and the M.Sc. degree in 1986, all in electrical

engineering from Xi'an Jiaotong University. From 1986 to 1992, he was a lecturer of Electrical

Engineering Institute at Xi'an Jiaotong University. Mr. Jia jointed the Department of Electronic

Engineering at City University of Hong Kong as a research associate in 1992 and will receive the Ph.D.

degree from same university. His major research interests are circuit simulation, mixed-signal VLSI and

parallel processing.

Richard Ming-Ming Chen graduated from Jiao Tong University in 1959, received his Master of

Engineering degree from Pratt Institute, New York in 1965 and his Ph.D. degree from Rutgers Ð

the State University of New Jersey in 1968, all in electrical engineering. He was with Bell Telephone

Laboratories, USA from 1968 to 1976. Dr. Chen is currently the Associate Head of the Department

of Electronic Engineering at City University of Hong Kong. His current research interests are in

various aspects of computer-aided analysis and design. He is a senior member of IEEE, a fellow of

HKAAST and a member of Eta Kappa Nu and Sigma Xi.


Andrew Martin Lay®eld graduated from the University of Sussex in 1977 with a degree in applied

physics, and obtained an M.Sc. in experimental space physics at University of Leicester in 1978 and

his Ph.D. at the University of Hull in 1984. He is currently an Associate Professor of Electronic

Engineering at City University of Hong Kong. His current research interests include computer

system design, simulation, computer graphics and parallel processing.


Documents

A parallel direct method circuit simulator based on sparse matrix partitioning