ECSE 548 - Electronic Design and Implementation of the Sine function on 8-bit MIPS processor - Report

1Design and implementation of the sin function foran 8-bit MIPS processor

Dominik Laskowski, Payom Meshgin, Daniel Ranga, Ming Yang

I. MOTIVATION / BACKGROUNDHardware acceleration of transcendental functions is crucial

for real-time systems performing computationally intensivetasks like computer graphics and audio processing. The x87floating-point unit in the IA-32 architecture supports instruc-tions like fsin and fsincos. Likewise, graphics processingunits and digital signal processors provide dedicated logic fortrigonometric operations.

Lookup tables based on ROMs and PLAs are the mostcommon approach for implementing sin in hardware. Theyare often coupled with approximation techniques like interpo-lation to achieve adequate precision while reducing transistorcount. An alternative method that offers better precision at theexpense of speed is the Taylor series expansion. Finally, theCORDIC (for Coordinate Rotation Digital Computer) algo-rithm is desirable in embedded systems without a multiplier,since it only requires adders, shifters and lookup tables. In thisreport, we investigate the design and implementation of a sinfunctional block for an 8-bit MIPS processor, using the firsttwo aforementioned approaches.

In practice, sin blocks usually operate on 32-bit IEEE754 floating-point numbers. However, since the MIPS corehas 8-bit registers and integer operations only, we opted fora custom encoding scheme based on binary scaling. Theassumed domain and image is [0, pi2 ] and [0, 1], respectively.These ranges are discretized to an integer between 0 and 255.

II. DESIGN IMPLEMENTATIONA. Lookup Table Implementation

The sin lookup table is a NOR-NOR PLA with 8-bit inputand output. The design process for this implementation wasstraightforward. The decimal and binary representations ofthe 256 angles and sin results were calculated in an Excelspreadsheet and exported to a CSV file. A simple Python scriptwas written to convert the CSV file into a Verilog casezstatement. The schematic and layout, shown in Figure 1, wereobtained from the PLA generator and tweaked in Electric topass DRC. Figure 2 is the final layout after the lookup tablewas mirrored vertically and wired up to the datapath.

B. Taylor expansion design

sin(x) = x x3

3!+x5

5!+O(x7) (1)

The formula written in equation 1 is a three-term expansionof Taylor series for sin(x). In this project, we implemented acombinational logic circuit based on this equation to evaluate

Fig. 1. Schematic and layout of the lookup table

Fig. 2. Layout of MIPS core with lookup table

the sin function. The reason for choosing that is related to thegoal of this project, which is to have a sin function generatorwith an accuracy that is comparable to the LUT solution, andTaylor series does provide an adjustable accuracy, based onthe number of terms included the evaluation.

During the design phase of this circuit, as shown in Fig.3, atop-to-down design approach is followed to implement system-

2Fig. 3. Circuit Design Methodology

level, gate-level and transistor level design of the entire system.More specifically, due to the fact that the same interface is usedfor circuit design and simulation in Simulink, circuit designbecomes much more efficient than the case that it is designedin Electric and then get verified in ModelSim. Therefore, thesystem-level and gate level design and simulation are bothimplemented in Simulink, and then they are directly translatedinto Electric to implement the transistor-level design after thegate-level design functionality has been verified.

Fig. 4. Absolute approximation error with different Taylor series

At the beginning of the project, an error estimation processis conducted to help us identify the target of this designso that the corresponding circuit can be build up based onspecific design requirements. Fig.4 demonstrates the absoluteapproximation error caused by applying Taylor series withdifferent conditions. Since the design is required to havesimilar accuracy as the LUT implementation (error 0.2%), wedecided to use three terms Taylor series and expand it respectto zero so that the estimation error is ensured to be smallerthan 0.5%.

Fig. 5. Finalized system-level design of Taylor series implementation

Fig.5 illustrates the finalized system-level implementation ofTaylor series implementation and several optimizations havebeen made at this level. First of all, some of the generatednumbers, such like x2 and x3/6, can be reused in this algo-rithm to reduce the complexity of the circuit. Second, divisioncan be carried out by multipliers since the dividend is always

larger than one. The cost can be further reduced by shiftingoperation if the dividend contains a factor of 2. To make thedesign consistent with our loop-up table implementation, wewere about to encode all the fixed point numbers by shifting8 bits in our system. However, it turns out that this will leadto an error if the input value becomes larger than 1 in radiansdegrees). Therefore, as demonstrated in the figure, a multiplieris placed in the front end of the entire circuit so that the inputvalue can be encoded by shifting 6 bits to the right, and thefixed point precision has to be compromised by 2 bits.

Fig. 6. Finalized gate-level design of Taylor series implementation

In schematic design, based on the information providedin the textbook, the 8-bits Ladner-Fischer adder and carry-save adder design are performed in the multiplier design toeffectively reduce the propagation delay on the critical path.Mirrored version of carry-save adder is also used to improveits shape during the layout process so that it will be in anorganized rectangular form. Fig.6 shows the finalized gate-level schematic design for the overall system in Electric.

Fig. 7. Finalized transistor-level layout design of multiplier

The transistor-level implementation is performed based onthe validated schematic design demonstrated in the previoussection. Fig.7 illustrates the layout design of the 8-bits mul-tiplier. The carry-save adder section is well organized in arectangular shape and a LF adder is placed on the bottomto sum up each carry bit. The entire system is organized in asimilar way, and because of the correct methodology practiced

3in this design, the simulation for the system was successful forthe first trial.

C. Modification to MIPS processorA conservative method was used to connect the generated

functional block to the available MIPS core. For the purpose ofintegration the functional blocks are 8-bit input, 8-bit outputblack boxes which act as an extended arithmetic operation.As such, the sin operation is treated as an R-type instruction, operating in parallel with the present ALU (for which thesecond source register is ignored). The assigned function codeto sin is 101111. This approach requires modification to theALU decoder while keeping MIPS state controller intact. TheALU decoder is extended by an other output :

ALUctrl[3] = ALUop[1]funct[0] funct[2] funct[3]

ALUctrl[3] is used as a select signal to a 2:1 wordslicemultiplexer, choosing as output either the result generated bythe ALU, when performing an ALU task, or the result from thesin functional block. While the equation for ALUctrl[3]may not be optimized given the available signals within theALU decoder the extra logic is neither limiting in the contextof local routing nor in the context of total area. Finally, theALU and circuits to the right of it have to be moved to the rightto allow insertion of the mux wordslice and permit routing oftwo 8-bit busses (srcA and result) as the implementationof the sin block wasnt possible within the row constraints setby the datapath.

III. VALIDATIONTo verify the correct functionality of the design, each

component was verified before being included in the globaldesign.

A. Look-up table designFor the lookup table implementation of the functional block,

a Verilog module was implemented, generating the expectedoutput associated to each possible input to the block. A Verilogdeck was generated from the layout of the design, and theoutputs to both modules were directly compared to ensure bothresults match.

B. Taylor expansion designAs for the Taylo Compare with result of the theoretical

Taylor expansion In simulation, match results with those fromSimulink model

C. Modification to MIPS processorRan same testbench from the labs Wrote a MIPS assembler

to help write small MIPS programs Checked that modificationsdid not affect the operation of other MIPS instructions Acceptssin instruction In Modelsim, used Verilog model to verifycorrect operation of the MIPS processor

IV. RESULTS & EVALUATIONThe two implementations of the design were evaluated using

the following three metrics: Accuracy (Error between the output of functional block

and the theoretical evaluation of the sine function) Size (Number of transistors in the layout of the func-

tional block) Scalability (How the complexity of the design scales as

the bit-width of the signals increase)

A. Results & AccuracyThe accuracy of the evaluation is very important for func-

tional blocks such as our sine block. Naturally, the evaluationof the block at any input is designed to match the theoreticalresult of the sine function at the discrete points defining theinput domain of the block. Fig. 8 shows how closely bothimplementations of the sine block follow the theoretical valueof the function over the entire input domain.

Fig. 8. Evaluations of the sine function using both implementations and thetheoretical value

In Fig.9, a more precise view of the accuracy is shown.From the plot, it is evident the accuracy of the lookup tablebased implementation is extremely accurate. This is, of course,not a surprise, since the look-up table is designed such that fora given input, the block takes as output the closest value tothe theoretical sine function. Hence, the absolute error for theLUT design never exceeds the error due to quantization, whichfor this digital design is equal to 29 0.002.

As for the Taylor expansion design, the accuracy is greaterthan that of the LUT implementation of the sine block. Thisbehaviour is explained by the loss of precision incurred whileevaluating the Taylor series terms. The dominant error termis hence 27, due to the reduction of fixed point precision to6 bits. In addition, some of the error, especially near the endof the input range, is caused by the Taylor series expansionitself, namely the fact that only its three most dominant termsare computed.

B. SizeThe area of both implementations of the block were mea-

sured as follows: LUT: 3200 300 = 0.96 1062

4Fig. 9. Plot of the absolute error for each evaluation of either implementationof the sine block

Taylor: 7000 4700 = 32.9 1062The lookup table design is significantly smaller than theTaylor expansion design, both in terms of area as well as thenumber of transistors. Indeed, the Taylor implementation ofthe functional block dwarfs the entire MIPS core in size, asseen in the image of the integrated design (Fig.10).

Fig. 10. Overall layout design of the integrated system including the Taylorexpansion implementation and the MIPS core.

C. Scalability

The scalability of each implementation differs. Given n asthe bit-width of the operand (n = 8 bits in this case), the LUTbased implementation scales quadratically as the precision ofthe operation increases, so the complexity of the design is onthe order O(n2). The Taylor expansion based design, howeverscales linearly, which leads to a smaller difference between thesizes of the two implementations.

V. POSSIBLE IMPROVEMENTS

There are a number of improvements to the current designthat would improve the functionality of the sine block. Mostimportantly, a linear interpolation could be employed for theLUT design, which would greatly reduce area and numberof transistors at the expense of a slightly larger error. Thismodification, involving the use of an additional (n)-bit wide

adder, is essential for larger-precision environments (32-bit,64-bit).

Another improvement would be to extend the functionalityof the block. For example, the sine block could be extendedsuch that the input angle can be within the range 0 to 180 or theentire possible range of angles. Moreover, other trigonometricfunctions could be implemented with minor modifications tothe current design, such as the cosine (using an adder) andtangent (using a divider) functions.

Finally, the MIPS core can also be extended to improve theperformance of the functional block. For instance, the MIPScore could be modified to enable multicycle evaluation, usingthe ALU to compute all each addition and sum need to evaluatea Taylor expansion. This modification leads to greater savingsin area and in the number of transistors.

If the LUT can be rotated 90 the LUT integrated MIPS corewould be small enough vertically to package (horizontally, thepackage may require the addition of pins or an increase in pinspacing). While the system does work, it is highly unlikely thelayout is can be manufactured due to a significant amount oflong parallel wires, especially tying the functional block to thethe datapath. Additionally, there was no attempt made to scalethe gates driving these long routing wires.

As implemented the MIPS core could accept many morefunctional blocks similar to sin. Each function would requireits own function code, an input on the wordslice mux and theALU control codification would have to be extended.

Finally, while the functional implementation operates on an8-bit number it is possible to extend functionality to 16 bits asboth srcA and srcB are available as inputs to the ALU. Itwould also be possible to extend the output to 16 bits howeverthis would require modifying the core state machine as it wouldrequire two write cycles instead to the single cycle R-typeoperations allow.

VI. CONCLUSIONIn conclusion, two methods to compute the sin function

have been implemented, integrated into the MIPS core andfunctionally validated, one using a look-up table PLA anotherusing a mathematical Taylor series expansion. The blocksoperation relies on a custom input and output encoding whichrequires the programmer to validate that the input data isas intended and to perform post-processing operations onthe output as required. For an 8-bit wide implementationthe LUT implementation performs best while maintaining thesmallest area however as bus width increases the Taylor seriesimplementation becomes more advantageous area-wise. Still,in that particular case, an lookup table based model with linearinterpolation is likely superior to any Taylor series expansion.

Documents

ECSE 548 - Electronic Design and Implementation of the Sine function on 8-bit MIPS processor - Report