Processor Report_ECE 174

  • Upload
    a-s

  • View
    219

  • Download
    0

Embed Size (px)

Citation preview

  • 8/10/2019 Processor Report_ECE 174

    1/34

    California State Univesity, Fresno

    Final Project Report

    Pipelined Processor

    Author:

    Abhijit SupremInstructor:

    Dr. Tarek Elarabi

    December 11, 2014

  • 8/10/2019 Processor Report_ECE 174

    2/34

    Contents

    1 Statement of Objectives 21.1 Processor Design Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2 Background 22.1 Verilog - Language and Workow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

    2.2 Software Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 Design Overview - Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.4 MIPS Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.5 Pipelined Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

    3 Pipelining 43.1 ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

    3.1.1 Logic and Shift Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73.1.2 Arithmetic Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    4 Verilog Code for Processor Units 84.1 Instruction Fetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.2 Instruction Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94.3 Instruction Decode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.4 Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.5 Execute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124.6 Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.7 Data Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144.8 Write Back . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154.9 Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

    5 Testbench Procedures 175.1 Testbench for Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2 Programming the Processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

    5.2.1 Arithmetic program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175.2.2 Load/Store Program . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

    5.2.3 Bubble Sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 Conclusions 21

    7 Appendix A 227.1 Instruction Decode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227.2 Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

    List of Figures

    2.1 MIPS Pipeline block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.1 Pipelined processor block diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.2 Format for arithmetic, branching, and data memory instructions . . . . . . . . . . . . . . . . 53.3 Prototypical module for logical and shift operations . . . . . . . . . . . . . . . . . . . . . . . 75.1 Simulation of Arithmetic operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.2 Simulation of Load/Store operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195.3 Sorting: Worst Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215.4 Sorting: Realistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

    1

  • 8/10/2019 Processor Report_ECE 174

    3/34

    1. Statement of Objectives

    The objective of this nal project is to conceptualize, design, and validate a pipelined processor based onthe MIPS architecture, but with simplifying modications. An Arithmetic Logic Unit (ALU) with vari-ous arithmetic and logical operations is designed in conjunction. The processor is built from the groundup in Verilog HDL with Quartus II development enviromnent and simulated under ModelSim-Altera RTLSimulation Platform.

    Included in this report are the following: Overview of the processor through a block diagram

    Detailed description of each of the processor stages and their implementation in Verilog

    Simulation results of the processor with a few programs:

    Arithmetic program that performs operations Basic Load/Store program demonstrating memory accesses 8-number bubble sort combining these operations

    Discussion of future work and possible improvements

    1.1 Processor Design ObjectivesThe processor is designed under 8-bit register/data considerations. The instructions are 32-bits long. Theyare divided into four 8-bit components (see Table 1).

    Table 1: Instruction Length

    Bits Name Function8-bits OPCODE Operation code8-bits REG DEST, PC NEXT Address of destination register, or address of next instruction (branching)8-bits REG 1, MEM Address of 1 st register or memory address8-bits REG 2, LITERAL Address of 2 nd register or literal value

    The processor will have the standard 5-stage pipeline with Instruction Fetch, Instruction Decode, Execute,Data Memory Access, and Write Back. Each of these will be discussed in Section 3. For the processor, theinstruciton set are rst dened. The instructions are categorized into four sections: arithmetic instructions(23 instructions), branch instructions (4 instructions), memory access instructions (4 instructions), and nooperation instructions (1 instruction).

    2. Background

    2.1 Verilog - Language and Workow

    Verilog is a standardized (IEEE 1364) Hardware Description Language used for modeling electronic sys-tems. Hardware Description Languages (HDLs) are used for various reasons, including readability, earliersimulations, feasibility studies, and abstraction.

    The Verilog design ow contains the following steps (summarized):

    1. Specication

    2. HDL Coding

    3. Synthesis

    4. Place and Route

    5. Timing Analysis

    2

  • 8/10/2019 Processor Report_ECE 174

    4/34

    6. FPGA implementation

    This project will include the rst 5 steps, plus ModelSim simulation in lieu of FPGA implementation.Specication entails conceptualizing the problem and determining I/O pins and other requirements such astiming constraints. HDL Coding is a high level, RTL abstraction of the design that can then be synthesized.Synthesis entails converting the HDL code into a bit-code that can be programmed into the Field PropagatedGate Array (FPGA). Placing and Routing involves selecting the actual location of placement on the FPGA

    chip. The nal, synthesized design can then by tested by software simulation through ModelSim or otherthird party tools. The testing phase may include Formal Verication and/or Assertion Verication in orderto validate the modules function.

    2.2 Software Tools

    The following tools were used for this project: (i) Quartus II and (ii) ModelSim-Altera. A brief descriptionof each follows.

    Quartus II Quartus II is produced by Altera Corporation. It is an FPGA platform for logic design andsynthesis for implementation. It is used to program the modules and view the resultant RTL circuits, Post-Map Netlist, and State Machine Diagram, if it exists. Quartus also provides various hardware synthesizersto simulate real-world implementation. Quartus II also performs the synthesis functions by converting the

    HDL Code to FPGA bit-code.

    ModelSim-Altera ModelSim-Altera is the simulation tool for validating the synthesized designs fromQuartus II. Timing information from Quartus II is used to run the simulation. ModelSim can be used to runstandalone simulations as well.

    2.3 Design Overview - Processor

    The processor, also known as the central processing unit, is the circuit that carries out the arithmetic,i/o, and control operations as specicied by a user. The term has been used for over fty years and yet,the denition has not changed. The processor today contains two basic units: an arithmetic logic unit toperform operations and a control unit to fetch instructions and execute them. A processor can also containperipherals such as registers to temporarily store data for faster access, cache to store recently used data, adata memory unit to store long-term data and an i/o interface to communicate with human operators or theenvironment and to deliver information.

    2.4 MIPS Architecture

    A processor is useful with an instruction set - a set of instructions the processor can perform. A simplearchitecture useful for rudimentary processor design is the MIPS architecture, a RISC architecture widelyused in academia. It has a simple instruction set, which has been further simplied and reduced for thisproject. The MIPS architecture also takes advantage of pipelining. The generalized ve-stage pileline isshown in Figure 2.1.

    2.5 Pipelined Architecture

    To understand a pipelined architecture, it is necessary to examine a non-pipelined architecture. In such anarchitecture, a processor executes one instruction each clock cycle. Thus, future instructions wait for thecurrent instruction to be nished before they can continue. A processor can signicantly increase efficiencyby splitting an instruction into several portions and executing each separately. If a processors execution issplit into ve components, then in the ideal case, the processor will need 15 of its original clock cycle (In areal scenario, different stages will take different amounts of time and the clock will be based on the worstperforming cycle). As such, the processor can begin the next instruction when the rst stage of the currentinstruction is completed. Table 2 shows an example. As seen, in the rst clock, the rst stage of the rstinstruction is completed. In the second clock cycle, the rst stage of the second instruction and the secondstage of the rst instruction is completed. In the third stage, the rst stage of the third instruction, thesecond stage of the second instruction, and the third stage of the rst instruction is completed. The latency

    3

  • 8/10/2019 Processor Report_ECE 174

    5/34

    Figure 2.1: MIPS Pipeline block diagram

    Table 2: Pipelined Instruction

    Clock 1 Clock 2 Clock 3 Clock 4 Clock 5 Clock 6 Clock 7 Clock 8 Clock 9 ClockInstruction 1 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5Instruction 2 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5Instruction 3 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

    Instruction 4 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5Instruction 5 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5Instruction 6 Stage 1 Stage 2 Stage 3 Stage 4 Stage 5

    remains the same (it in fact increases in a real-world scenario as the slowest clock is used for all ve stages),but the throughput increases. In ten clock cycles, a non pipelined processor can complete two instructions,whereas a pipelined processor can complete six.

    3. Pipelining

    The ve stages of the MIPS pipeline are incorporated into this processor. The processor block diagram isgiven in Figure 3.1. There are 8 main modules present: the ve pipeline stage modules, an instructionmemory , a stack , and registers .

    Folliwing are the descriptions of each module and their functionality

    Instruction Fetch This module contains a register with the current PC address. This address isincremented each clock cycle. The module sends the address to the instruction memory unit and receives the32-bit instruction. This instruction is sent to the Instruction Decode unit. The current address is also sent.In the case of a branch instruction, the address can be stored in the stack. Note also the inputs from theExecute unit. The Jump Enable is a ag that overwrites the existing PC address with the next address. Itis asynchronous and once it occurs, the Instruction Fetch unit sends the new instructions to the Instruction

    4

  • 8/10/2019 Processor Report_ECE 174

    6/34

    Figure 3.1: Pipelined processor block diagram

    Decode unit.

    Instruction Memory This memory stores 32-bit instructions. As the PC Address is 8 bits long, thememory can carry 2 8 1 instructions, i.e. 255 instructions. The memory unit is combinational and selectsthe memory contents based on the input.

    Instruction Decode This unit controls the execute unit by providing the correct inputs based on thereceived instruction. There are three instruction types: Arithmetic instructions, Branch/Jump instructions,and Load/Store Instructions. Each has a different instruction format (Figure 3.2).

    Figure 3.2: Format for arithmetic, branching, and data memory instructions

    For arithmetic operations involving two registers, data is retrieved from the register units with the pro-

    5

  • 8/10/2019 Processor Report_ECE 174

    7/34

    vided addresses in the instruction. For arithmetic instructions involving a register and an immediate/literal,only one value is retrieved from the registers. The over value is already present in the instruction. For Branchinstructions, the unit sets the Stack ag, informing of the branch type (either branch with return, uncondi-tional jump, or return to earlier instruction). For Load/Store instructions, the Memory address contained inthe third section of the instruction (Figure 3.2 must not be modied. Further, the memory read/write agis set. The ALU opcode is also determined.

    Table 3 shows the ISA for this processor as well as derived values. For each instruction, the Decode

    unit decides which operation to trigger in the Execute stage. There are ve operation ags: aluOP for theALU, readOP and writeOP for the memory and jumpOP and branchOP for the stack (jump and branchinstructions).

    Table 3: Pipelined Instruction

    Opcode Type Instr. alu read write branch jump0 NOP NOP nop 0 0 0 0 01 Arithmetic add addition 12 Arithmetic addi literal addition 13 Arithmetic sub subtraction 14 Arithmetic subi literal subtraction 15 Arithmetic mult multiplication 1

    6 Arithmetic power power of 2 17 Arithmetic slt shift left 18 Arithmetic slti shift left by literal 19 Arithmetic srt shift right 110 Arithmetic srti shift right by literal 111 Arithmetic and bitwise and 112 Arithmetic andi literal bitwise and 113 Arithmetic or bitwise or 114 Arithmetic ori literal bitwise or 115 Arithmetic not bitwise not 116 Arithmetic nor bitwise nor 117 Arithmetic nori literal bitwise nor 118 Arithmetic xor bitwise xor 119 Arithmetic xori literal bitwise xor 120 Arithmetic xnor bitwise xnor 121 Arithmetic xnori literal bitwise xnor 122 Arithmetic nand bitwise nand 123 Arithmetic nandi literal bitwise nand 124 Branch jump jump to address 0 125 Branch beq branch ( r 1 = r 2) (push) 1 126 Branch bgt branch ( r 1 > r 2) (push) 1 127 Memory load load from mem to reg 128 Memory loadi load # into reg 129 Memory store store reg to mem 130 Memory storei store # in mem 131 Branch ret pop from stack 1

    Registers The Register unit return memory contents upon a query. It receives queries from the Instruc-tion Decode unit and the Write Back unit. For the Write Back queries, the register contents are modied.

    Execute This unit contains the ALU and some extraneous blocks. It performs the arithmetic operationsfor arithmetic instructions and comparision operations for the branch instructions. The execute unit is alsothe liason for the Load/Store instructions; it passes on the data received from the Decode unit withoutadjusting them.

    6

  • 8/10/2019 Processor Report_ECE 174

    8/34

  • 8/10/2019 Processor Report_ECE 174

    9/34

    3.1.2 Arithmetic Operations

    There are four basic arithmetic operations in any ALU: addition, subtraction, multiplication, and division.There are of course more complex operations such as exponentials, exponents, logarithms, and roots. Theseare implemented through one of the following methods, in order of most expensive to least expensive (thusfastest to slowest):

    1. Dedicated hardware to compute operation results. This hardware may involve lookup tables and otherhardware-specic optimizations. Often the calculation is completed in single clock cycle.

    2. Pipelined dedicated hardware. While only slightly slower than dedicated single cycle hardware, it is acheaper and practical alternative.

    3. Software approach using extant operations. In this method, there is no dedicated hardware and pro-grammers must develop algorithms to calculate the operation.

    For this ALU, addition, subtraction, and multiplication are covered. Division is omitted due to inherentcomplexity.

    Binary addition follows decimal addition, except with only two possible digits. There may be overowbits, but these are not important for the ALU for this project., though commercial ALUs do contain an extraoutput bit for carries or overows.

    00011011+00100111

    01000010

    (3.1)

    The simplest method to perform binary subtraction is through 2s complement. The steps with 2s com-plement are as follows, with the 2s complement steps boldfaced:

    A B A + B + 1B + 1B + 1 (3.2)

    So, the result of 00010101 minus 00001111 (2110 1510 required the 2s complement of the secondoperand. So, 00001111 inverted is 11110000 . Adding 1 yields 11110001 . Finally, addition yields:

    00010101

    +1111000111100000110

    (3.3)

    The boldfaced bit is not required. The result is 00000110 , which is 610 . So subtraction can be used withan adder.

    Binary multiplication also takes a similar form with decimal multiplication. There are, of course, variousoptimized methods; however, the approach used for the ALU is a shift-add multiplier that uses the algorithmgiven in Section 3.3 from [1]. In short, for each HIGH bit of the multiplicand, the multiplier is shifted by theplace of the multiplicand bit and added to the product.

    For this ALU, the multiplier and multiplicand are both 8 bits. The result is a 16 bit output truncated to8 bits to maintain a standardized output across all operations. This multiplier operates by shifting for eachbit and adding to the product if the nth value is HIGH.

    Division Binary division is more difficult and complex that binary multiplication. In addition to thequotient, there also needs to be additional hardware for the remainder. The algorithm is similar to themultiplier; the differences are: (i) division also involves left shifts when subtraction prediction is incorrect,(ii) quotient and remainder are on the same register, but in different locations, i.e. the remainder is HI andthe quotient is LO.

    4. Verilog Code for Processor Units

    In this section, the Verilog code for each component will be elaborated.

    8

  • 8/10/2019 Processor Report_ECE 174

    10/34

    4.1 Instruction Fetch

    The Verilog code for the Instruction Fetch module is provided in Listing 1.

    Listing 1: Verilog code for Instruction Fetch1 // F e tc h t h e i n s t r u c t i o n f rom I n s t r u c t i o n Memory b a s e d o n c u r r e n t a d d re s s 2 module In str uct io nF etc h ( clk , pcIns , insR ecei ve , insOut , pcOut , j In , nextPc ) ;

    3 // nex tPC i s n e x t i n s t r u c t i o n fr om e x e c u t e c o nt a i n s a v a l u e when 4 // jump i n s t r uc t i o n i n p i p e l i ne . I t i s r ead when t h e j I n jump f l a g i s s e t 5 input [ 7 : 0 ] n ex tP c ;6 input [ 3 1 : 0 ] i n s R ec e i ve ;7 input j I n , c l k ;8 // O ut pu ts a re i n i t i a l i z e d t o 0 and a re t h e i ns Ou t ( i n s t r u c t i o n ) ,9 / / pc Ins ( add res s fo r i n s t ruc t ion memory) , and pcOut ( cu r ren t addres s )

    10 output reg [ 3 1 : 0 ] i nsO ut = 0 ;11 output reg [ 7 : 0 ] p cI ns= 0 ;12 output reg [ 7 : 0 ] pcOut = 0 ;13 / / i n t e rna l memory fo r cu r ren t addres s incremen ted each c lock 14 reg [7 : 0 ] pcNow;15 / / i n i t i a l i z e to 0

    16 in i t i a l beg in17 pcNow = 0 ;18 end1920 // a t c l o c k c ha ng e o r when t h e i n s t r u c t i o n i s r e c e i v e d fr om memory or 21 / /when the cu r ren t addres s inc remen t s 22 always @( ins Rece ive , pcNow, cl k ) begin23 // o u tp ut t h e r ec e i v e d i n s t r u c t i o n and t h e cu r re n t i n s t r u c t i o n a dd r es s 24 insOut < = i n s R e c e i v e ;25 pcOut < = pcNow;26 end2728 // a t p o s i t i v e e dg e o f c l o ck o r jump f l a g 29 always @(posedge c lk , posedge jI n ) begin30 // i f f l a g i s s e t t hen r e se t t h e c ur re nt a d dr es s in i n te r na l 31 //memory t h i s t r i g g e r s t he i n s t r . mem t o s e t a new i n s t r uc t i o n b ac k 32 i f ( jI n ) begin33 pcNow < = nextPc ;34 pcIns < = nextPc ;35 end36 // i f f l a g n ot s e t t he n t he c u r r en t a d dr e ss i s i nc re me nt ed 37 else begin38 pcNow < = pcNow+1;39 pcIns < = pcNow+1;40 end41 end4243 endmodule

    The IF module sends the current address to the instruction memory and when it receives the instruction,it sends this instruction out to the Instruction Decode unit (Line 24). It increments its internal address andstores it across clock cycles (Line 38). However, when a jump ag comes with a new address, the internaladdress is replaced (Line 33).

    4.2 Instruction Memory

    The Verilog code for the Instruction Memory module (titled ProgramCounter) is provided in Listing 2.

    9

  • 8/10/2019 Processor Report_ECE 174

    11/34

    Listing 2: Verilog code for Instruction Fetch1 / /Program Coun te r con ta ins the in s t ru c t i ons 2 module ProgramCounter( pcIn , InsOut ) ;3 // i n p u t s i n cl u d e t he r e c e i v e d a dd r es s 4 // o ut pu t i s t he i n s t r u ct i o n 5 input wire [ 7 : 0 ] p cIn ;6 output [ 3 1 : 0 ] I ns Ou t ;7 // t h i s i s t h e i n t e r n a l memory 8 // i t i s a 31 b i t 25 5 e l e me nt a r ra y 9 reg [ 3 1 : 0 ] pcMem [ 0 : 2 5 5 ] ;

    10 // c o un te r f o r memory i n i t i a l i z a t i o n 11 integer i ;12 // l o a d t h e memory w i t h b l a c k s , and t h e n a dd i n t h e i n s t r u c t i o n s 13 in i t i a l beg in14 for ( i =0; i < 8d256 ; i=i +1) begin15 pcMem [ i ] = 3 2 h00000000 ;16 end1718 / 19 Programs go here :20 pcMEM[ 0 ] = 32 h1c010502 ;21 pcMEM[ 1 ] = 32 h1b343001 ;22 pcMEM[ 2 ] = 32 h01030403 ;23 / 24 end25 // a s s i g n s t a te m e nt i s c o m b in a t io n a l o u t pu t i s c o nt i nu o us 26 assign InsO ut = pcMem[ pcIn ] ;27 endmodule

    The Memory contains 255 elements (Line 9). The unit initializes with zeros for all instructions - in effect,NOPs for all instruction. The actual instructions are then added in lines 19-22. The instructions are in 32bit, hex format, as per assembly code specications. Finally, the input is constantly evaluated and so theoutput is combinational.

    4.3 Instruction Decode

    The Verilog code for the Instruction Decode module is provided in Listing 3.

    Listing 3: Verilog code for Instruction Decode1 // D ec od er s e t s f l a g s f o r t h e E x ec ut e a nd t h e ALU op co de 2 module In st ruc ti on De co de ( clk , ins In , pcIn , pcOut , insOut , opcode , r2Out , r1Out ,3 rDestOut , r1RegGet , r1Rec eive d , r2RegGet , r2Re ceive d ,4 aluOp , readOp , branchOp , jumpOp , writeOp ,wb, aluOpcode , stackOp ) ;5 // I n p ut s a re t h e i n s t r u c t i o n f rom t h e F et ch u n i t 6 input c lk ;

    7 input [ 3 1 : 0 ] i n sI n ;8 input [ 7 : 0 ] p cI n ;9 // Out pu ts a re t h e f l a g s f o r t h e e x ec ut e s t a g e a s w e l l

    10 // as r e g i s t e r a dd re ss es f o r r e g i s t er r e t r ie v e 11 // and t h e d at a r e t r e i e v e d from t h e r e g i s t e r i t s e l f 12 output reg [ 7 : 0 ] pcOut , opcode , r1Out , r2Out , rDestOut , r1RegGet , r2RegGet ;13 input [7 : 0 ] r1Rece ived , r 2Rece ived ;14 output reg aluOp , readOp , branchOp , jumpOp , writeOp ,wb, stackOp ;15 output reg [ 3 : 0 ] a luOpcode ;16 output reg [ 3 1 : 0 ] i ns Ou t ;17 / / temopora ry memory un i t s t o s to re the l i t e r a l va lue

    10

  • 8/10/2019 Processor Report_ECE 174

    12/34

    18 / /and the opcode 19 reg [7 : 0 ] r1L i t e r a l , r 2L i t e r a l , opcodeReg ;2021 always @(posedge cl k ) begin22 // s e t up t h e d i f f e r e n t v a r i a b l e s f o r o ut pu t , i . e .23 // i d e n t i f y t h e l i t e r a l v a l ue s , t h e a d d r e s s e s 24 / /and the opcode

    25 insOut < = i n s I n ;2627 r1RegGet < = i n sI n [ 1 5 : 8 ] ;28 r2RegGet < = in sI n [ 7 : 0 ] ;2930 r 1 Li te r al < = i n sI n [ 1 5 : 8 ] ;31 r 2 Li te r al < = in sI n [ 7 : 0 ] ;3233 rDestOut < = i n s In [ 2 3 : 1 6 ] ;34 pcOut < = pcIn ;3536 / /Determine the opcode 37 opcodeReg < = i n s In [ 3 1 : 2 4 ] ;38 opcode < = opcodeReg ;39 end4041 // Whenever t h e s e temp r e g s c ha ng e , t he n s e t t h e f l a g s 42 // f o r t he e x ec ut e s t ag e 43 always @(opcodeReg or r1Received or r2Received or pc In ) begin44 stackOp < =0;45 case (opcodeReg)46 8 d0 : begin //NOP 47 / FLAG SET / 48 end49 8 d1 : begin //Add 50 / FLAG SET / 51 end52 / . . . / 53 endcase54 end55 endmodule

    The Instruction Decode code is provided in the appendix due to its length. A protytypical version isactually provided in this Listing. The Decoder rst separates the instruction into its component parts - theopcode (Line 35), the register addresses or memory addresses (Lines 25-29), and the destination register orthe next instruction address, depending upon the instruction opcode (Line 31).

    From Line 41, the ags (aluOP, writeOP, readOP, jumpOP, and branchOP) are set according to Table 2.Due to spacing limitations, the code is provided in Appendix A.

    4.4 Registers

    The Verilog code for the Register module is provided in Listing 4.

    Listing 4: Verilog code for Registers1 // T hi s m od ul e c o n t a i n s t h e r e g i s t e r memory f o r t h e p r o c e s s or 2 module Re gi st er s (r1Read , r2Read , wbRead, r1Send , r2Send , wbLit eral , wbFlag ) ;3 // Th er e a re 4 i n p u t s a nd tw o o u t p u ts 4 // The r e q u e s t ed r 1 and r 2 f rom t h e i n s t r u c t i o n 5 // The w r i t e b a ck f l a g , a nd t h e v a l ue t o b e s t o r ed t h e re 6 input [ 7 : 0 ] r1Read , r2Read , wbRead , wbLi tera l ;

    11

  • 8/10/2019 Processor Report_ECE 174

    13/34

    7 input wbFlag ;8 output reg [ 7 : 0 ] r1Send , r2Send ;9 // t h e r e g i s t e r i n t e r n a l , memory

    10 reg [ 7 : 0 ] regMem [ 0 : 2 5 5 ] ;11 //When any i n p u t c ha ng e , t h en s e t t h e o u t p ut s t o r ea d t h e r e g i s t e r 12 // i n t h e c as e o f a w r i t e b a ck r e qu e st ( L ine 1 6) , w r it e i n to r e g i s t e r 13 always @() begin

    14 r 1S en d = regMem [ r1Read ] ;15 r 2S en d = regMem [ r2Read ] ;16 i f (wbFlag ) begin17 regMem [ wbRead ] = w bL ite ra l ;18 end19 end20 endmodule

    The registers return requested values from the instruction decode - the decoder can then decide whetherthese values are actually necessary, i.e. they may be neccesarry for arithmetic or branch, where variablesfrom memory are necessary; however, for return or jump, they are not necessary. For a write back request,i.e. a query from the Write Back module, the module writes into the registers (Line 17).

    4.5 ExecuteThe Verilog code for the Execute unit is provided in Listing 5.

    Listing 5: Verilog code for Execute Stage1 / /P r o to typ ica l Execu te module runs the CPU 2 module Execu te ( cl k , aluOp , readOp , branchOp ,jumpOp , writeOp ,wb, stackOp , aluOpcode ,3 in sI n , pcIn , opcode , r1Val , r2Val , rDest , r3Dest , re su lt , jumpOut ,4 pcNext , rwPass , sta ck , memAccess , wbOut , pcNow ) ;5 // i n pu t s a re t he f l a g s from t he I n s tr u c ti o n d ec od e a s w e l l as t he 6 // r e g i s t e r v a l u es 7 input c lk ;8 input aluOp , readOp , branchOp , jumpOp, writeOp , wb, stack Op ;

    9 input [3 : 0 ] a luOpcode ;10 input [ 3 1 : 0 ] i n sI n ;11 input [ 7 : 0 ] pcIn , opcode , r1Val , r2Val , rDest ;12 // O ut pu ts a re t h e r e s u l t and a f ew f l a g s :13 output reg [ 7 : 0 ] r 3D est , r e s u l t ;14 / / Th es e f l a g s a r e f o r j um pi ng se nt t o f e tc h u ni t 15 / / rwPass i s fo r memory un i t 16 / /memAccess i s memory addr ess for load/ s to re 17 output reg jumpOut , wbOut ;18 output reg [ 7 : 0 ] pcNext ,pcNow;19 output reg [ 1 : 0 ] rw Pas s , s t a c k ;20 output reg [ 7 : 0 ] memAccess ;21 wire aluDone ;22 wire [ 7 : 0 ] a l uR es ul t ;23 / /ALU i s i n s t an t i a t ed 24 a luMain a lu1 ( . in1 ( r1Val ) , . in2 ( r2Val ) , . ou t ( a luRe sul t ) , . done( a luDone ) ,25 . s e l e c t o r ( a l uO pc od e ) ) ;2627 always @(posedge cl k ) begin28 / /Defau l t 29 r3Dest < = rDes t ; // D e s t in a ti o n r e g i s t e r 30 r e s u l t < = 0 ; // R e s ul t o f e x e cu t e ( f o r wb , l oa d , s t o r e )31 jumpOut < = 0 ; // jump f l a g fr om e x e c u t e 32 pcNext < = 0 ; / / nex t PC fo r j umping

    12

  • 8/10/2019 Processor Report_ECE 174

    14/34

    33 rwPass < = 0 ; / /whe ther r ead , wr i t e , o r pas sby 34 memAccess < = 0 ; / /Address for memory access 35 s t ack < = 0 ; //STACK NOP 36 wbOut < = wb;37 pcNow< =pcIn ;38 i f (wb) begin39 i f (readOp) begin

    40 // Load fr om memory i n t o d e s t i n a t i o n r e g i s t e r 41 memAccess < = r1Val ; //Memory add res s 42 rwPass < = 2 b01 ; / /Read f la g 43 end44 e ls e i f (aluOp) begin45 // a l l a lu i n s t r u c ti o n s r es u l t i s s en t t o r e g is t e r f or w ri ti ng 46 r e s u l t < = a luResu l t ; / /Resul t of add . . . and 47 rwPass < = 2 b00 ; // p as s f l a g 48 end49 else begin50 / / load immediate 51 r e s u l t < = r2Val ; // t a k e im me di at e va l u e f o r r D es t 52 rwPass < = 2 b00 ; // p as s f l a g 53 end54 end55 else begin56 i f (jumpOp) begin57 i f (s tackOp ) begin58 / / uncond i t iona l jump next PC enabled , and jumop Enabled 59 pcNext < = rDes t ; // g e t t he a dd re ss o f n ex t p c v a lu e 60 jumpOut < = 1 ; // e n a bl e t h e jump f l a g f o r IF 61 end62 else begin63 // r e t ur n t o p r e vi o u s v a l ue i n s t a c k pa ss on l s t s ta ck v al 64 jumpOut < = 1 ; // s e t jump f l a g f o r IF 65 s tack < = 2 b01 ; // pop s t a ck v a l and s en d t o IF 66 end67 end68 e ls e i f (wr i teOp) begin69 / / s to re o r s to r e immediat e f rom reg , l i t i n to memory 70 rwPass < = 2 b10 ; // Wri te f l a g 71 r e s u l t < = r2Val ; // S t o r e fr om t h i s , e i t h e r imme o r r e g 72 memAccess < = r1Val ; //Memory add res s 73 end74 e ls e i f (branchOp) begin75 // e i t h e r b ra nc h on e qu al , g r e a t e r t ha n ; s t o r e a dd r es s i n s t a c k 76 i f ( a luRe sul t == 1) begin77 jumpOut < =1;

    78 pcNext REG {1 , 2 , . . . , 8 }23 S et C oun te r = 045 SwapCheck :67 I f Reg1> Reg2

    8 Goto swap ( Reg1 , R eg2 )9 I f Reg2> Reg3

    10 Goto swap ( Reg2 , Reg3 )11 .12 .13 .14 I f Reg7> Reg815 Goto swap ( Reg7 , Reg8 )1617 I f Counter = 018 Goto F i na l19 Counter = 0 ;20 Goto SwapCheck212223 Swap :24 Reg 10 = Reg125 Reg1 = Reg226 Reg2 = Reg 1027 C oun te r = Co unt er + 128 Return2930 F i n a l :31 S to re REG {1 ,2 , . . . , 8 } > MEM {1 ,2 , . . . , 8 }32 End

    Line 1 in the pseudocode involves loading the set of numbers in the memory into registers. This allowsfaster access for swap operations. The Counter variable at Line 3 is a detector for the sort - during eachiteration, for each swap in the bubble sort, the counter increments. So, in the last iteration, when the sortingis complete, the Counter variable remains 0 and indicates that sorting is complete. From Lines 5-20 is theSwapCheck procedure. For each two registers, SwapCheck determines whether they need to be swapped toobtain the correct order. If swapping is necessary, then the Swap procedure at Lines 23-28 is called.

    In the Swap procedure, the two registers are swapped and the Counter is incremented to indicate a swapoccured. THe Swap procedure returns to the SwapCheck procedure.

    Back in the SwapProcedure, at Line 17, the Counter variable is checked. If it is 0, meaning sorting iscomplete, the Final procedure is called. If Counter is not 0, it is reset in Line 19 and the sorting beginsagain.

    If Counter is 0, however, the Final procedure at Lines 30-32 stores the sorted list back in memory.This is shown in Figure 5.3 and Figure 5.4. In the rst, the unordered set is the worst case scenario:

    9080706050403020 (5.2)

    In the second gure (Figure 5.4), the undordered set is more realistic:

    133435621126590 (5.3)

    So, as can be seen, the unordered set has been ordered and restored to memory in both cases.Note that the sorting in Figure 5.4 completed faster than the worst-case sorting, as expected. While

    Bubble Sort is inefficient, this program demonstrates the capabilities of the processor.

    20

  • 8/10/2019 Processor Report_ECE 174

    22/34

    Figure 5.3: Sorting: Worst Case

    Figure 5.4: Sorting: Realistic

    6. Conclusions

    In this project, a MIPS-based pipelined processor was built. The processors pipelined architecture wasdemonstrated with three programs - one performing arithmetic operations, one performing memory accessoperations, and one comprehensive sorting program that incorporates these as well as branching.

    The designed processor accepts 8-bit data. The Verilog implementation includes registers and datamemory, but no cache. The Instruction Memory is kept separate from Data Memory to prevent structuralhazars. Data and Control Hazards are avoided with proper placement of NOP commands.

    The processor is an arithmetic and logic based processor with 23 such commands. There are 4 memory

    access commands and 4 branching commands, as well as the NOP command, bringing the total to 32commands.With regard to improvement, the instruction set can be modied to place more focus on branching and

    memory commands by reducing the number of logic operations. As it is possible to implement logic withsoftware, there need not be so much focus on them. Further, a cache can be implemented that exists in placeof the data memory, with the memory as a completely separate element.

    Of course, a processor can truly be built when using VLSI. It is possible to take this Verilog Implemen-tation and convert it to a VLSI format that can be etched.

    21

  • 8/10/2019 Processor Report_ECE 174

    23/34

    7. Appendix A

    7.1 Instruction Decode

    The code for the instruction decode is provided. Note the case statement that performs actual decoding.The values of the ags are set based on Table 3.

    Listing 13: Verilog Code for Instruction Decode1 // 2 module In st ruc ti on De co de ( clk , insI n , pcIn , pcOut , insOut , opcode ,3 r2Out , r1Out , rDestOut , r1RegGet , r1R ece ive d , r2RegGet , r2R ece ive d ,4 aluOp , readOp , branchOp , jumpOp , writeOp ,wb, aluOpcode , stackOp ) ;5 input c lk ;6 input [ 3 1 : 0 ] i n sI n ;7 input [ 7 : 0 ] p cI n ;8 output reg [ 7 : 0 ] pcOut , opcode , r1Out , r2Out , rDestOut , r1RegGet , r2RegGet ;9 input [7 : 0 ] r1Rece ived , r 2Received ;

    10 output reg aluOp , readOp , branchOp , jumpOp, writeOp ,wb, stackOp ;11 output reg [3 :0 ] a luOpcode ;12 output reg [ 3 1 : 0 ] i ns Ou t ;

    1314 reg [ 7 : 0 ] r1L i t e r a l , r 2L i t e r a l , opcodeReg ;1516 always @(posedge cl k ) begin17 insOut < = i n s I n ;1819 r1RegGet < = i n sI n [ 1 5 : 8 ] ;20 r2RegGet < = in sI n [ 7 : 0 ] ;2122 r 1 Li te r al < = i n sI n [ 1 5 : 8 ] ;23 r 2 Li te r al < = in sI n [ 7 : 0 ] ;2425 rDestOut < = i n s In [ 2 3 : 1 6 ] ;

    26 pcOut < = pcIn ;27 / / insOut = insIn ;2829 opcodeReg < = i n s In [ 3 1 : 2 4 ] ;30 opcode < = opcodeReg ;3132 end3334 always @(opcodeReg or r1Received or r2Received or pc In ) begin35 stackOp < =0;36 case (opcodeReg)37 8 d0 : begin //NOP 38 r1Out < = 0 ;39 r2Out < = 0 ;40 aluOp < = 0 ;41 readOp < = 0 ;42 branchOp < = 0 ;43 jumpOp < = 0 ;44 writeOp < = 0 ;45 wb < = 0 ;46 aluOpcode < = 0 ;47 end48 8 d1 : begin //Add 49 r1Out < = r1Received ;

    22

  • 8/10/2019 Processor Report_ECE 174

    24/34

    50 r2Out < = r2Received ;51 aluOp < = 1 ;52 readOp < = 0 ;53 branchOp < = 0 ;54 jumpOp < = 0 ;55 writeOp < = 0 ;56 wb < = 1 ;

    57 aluOpcode < = 0 ;58 end59 8 d2 : begin //Add immediate 60 r1Out < = r1Received ;61 r2Out < = r 2 L i t e r a l ;62 aluOp < = 1 ;63 readOp < = 0 ;64 branchOp < = 0 ;65 jumpOp < = 0 ;66 writeOp < = 0 ;67 wb < = 1 ;68 aluOpcode < = 0 ;69 end70 8 d3 : begin / /Sub t r ac t 71 r1Out < = r1Received ;72 r2Out < = r2Received ;73 aluOp < = 1 ;74 readOp < = 0 ;75 branchOp < = 0 ;76 jumpOp < = 0 ;77 writeOp < = 0 ;78 wb < = 1 ;79 aluOpcode < = 1 ;80 end81 8 d4 : begin / /Sub t r ac t immediat e 82 r1Out < = r1Received ;83 r2Out < = r 2 L i t e r a l ;84 aluOp < = 1 ;85 readOp < = 0 ;86 branchOp < = 0 ;87 jumpOp < = 0 ;88 writeOp < = 0 ;89 wb < = 1 ;90 aluOpcode < = 1 ;91 end92 8 d5 : begin / /Mul t ip ly 93 r1Out < = r1Received ;94 r2Out < = r2Received ;

    95 aluOp