8. 16- Bit RISC Processor Design for Convolution Application Using Verilog HDL

CHAPTER1INTRODUCTIONTHE RISC-16 PROCESSOR. The RiSC- 16, for Ridiculously Simple Computer, has been developed by Prof. Bruce Jacob at the University of Maryland with an educational aim. There are two implementations of this architecture, a sequential one and a pipelined one. In this paper, we just give a small description of the sequential implementation. For more information about RiSC-16, the reader is invited to refer the three documents: [1] for the instruction set, [2] for the sequential implementation and [3] for pipeline implementation.The RiSC-16 is a RISC processor based upon Harvard architecture. As its name indicates, it is a 16 bits processor.All data and instructions are in two bytes, and so, all registers and the two memories are in short-word format. It is made up of: one bank of eight registers, addressable in three bits. The register 0 is read-only and contains the null value, whichis quite common among RISC processors separated instruction and data memories. Both are addressable in sixteen bits, and hence have a capacity of 64Kwords. one Arithmetical-Logical Unit (ALU) that can execute three operations: addition, bitwise nand and test of equality. multiplexers to choose between buses. one control unit. Its functions are to decode the Opcodes and to control the ALU, the multiplexers and the write function into the register bank and into data memory. a program counter (PC) and its incrementer. an instruction register containing the instruction that is being executed. an adder to compute jump addresses. two sign-extended logic blocs to convert the 7 bits immediate values into the 16 bit format. one left shift logic to convert the 10 bits immediate values into the 16 bit format. several buses to convey data between elements. control signals routed to the different blocs (for example, to choose the input bus of a multiplexer).Refer to Figure 1 to see how these are connected.

The instruction set consists of 8 instructions. Table I shows their assembler format and describes their operation.

This processor illustrates the RISC philosophy pushed to its maximum of simplicity. In fact, the instructions are elementary, but they are powerful enough to solve complex problems, and none instruction can be replaced by a combination of the other ones.The students are rapidly able to master this reduced set of 8 instructions and to write small programs. A second strong point of the RiSC-16 is the small number of internal elements. This permits displaying clearly all blocks on the screen. Furthermore, both the sequential and the pipeline version were implemented on a FPGA .

CHAPTER 2RISC( Reduced Instruction Set Computer)An IntroductionThe Reduced Instruction Set Computer, or RISC, is a microprocessor CPUdesign philosophy that favors a smaller and simpler set of instructions that all take aboutthe same amount of time to execute. The most common RISC microprocessors are ARM,DEC Alpha, PA-RISC, SPARC, MIPS, and IBM's PowerPC.The idea was inspired by the discovery that many of the features that were included in traditional CPU designs to facilitate coding were being ignored by the programs that were running on them. Also these more complex features took several processor cycles to be performed. Additionally, the performance gap between the processor and main memory was increasing. This led to a number of techniques to streamline processing within the CPU, while at the same time attempting to reduce the total number of memory accesses.When the controller design become more complex in CISC and the performancewas also not up to expectations, people started looking on some other alternatives. It hadbeen found that when a processor talks to the memory the speed gets killed. So the oneimprovement on CPI was to keep the instruction set very simple. Simple in not the way itworks but the way it looks. Thats why we have very few instructions in any typicalRISC architecture where processor asks data from memory probably not other than Loadand Store. We avoid keeping such addressing modes. The complexity of controller designhas been overcome with the help of operands and Opcode bits fixed in instructionregister. At the end the pipelining added a new dimension in the speed just with the helpof some additional registers. Now what pipeline does is it increases throughput byreducing CPI. The instruction can be executed effectively in one clock cycle. Thepipelining in any kind of architecture took birth from the inherent parallelism and the idlestates of components.The pipelined architecture could be further enhanced with the concepts known assuper-scaling. There we provide more than one execution unit. The time when one unit is

busy with the current execution task, the fetch unit can probably fetch he next instructionwhich would be executed with the help of some other execution unit present in system.Features which are generally found in RISC designs are: uniform instruction encoding (for example the op-code is always in the same bitposition in each instruction, which is always one word long), which allows fasterdecoding; A homogeneous register set, allowing any register to be used in any context andsimplifying compiler design. simple addressing modes (complex addressing modes are replaced by sequencesof simple arithmetic instructions); Few data types supported in hardware (for example, some CISC machines hadinstructions for dealing with byte strings. Others had support for polynomials andcomplex numbers. Such instructions are unlikely to be found on a RISC machine).Over many years, RISC instruction sets have tended to grow in size. Thus, some havestarted using the term "load-store" to describe RISC processors, since this is the keyelement of all such designs. Instead of the CPU itself handling many addressing modes,load-store architecture uses a separate unit dedicated to handling very simple forms ofload and store operations. CISC processors are then termed "register-memory" or"memory-memory".Today RISC CPUs (and microcontrollers) represent the vast majority of all CPUs inuse. The RISC design technique offers power in even small sizes, and thus has come tocompletely dominate the market for low-power "embedded" CPUs. Embedded CPUs areby far the largest market for processors. RISC had also completely taken over the marketfor larger workstations for much of the 90s. After the release of the Sun SPARCstationthe other vendors rushed to compete with RISC based solutions of their own. Even themainframe world is now completely RISC based.3. RISC vs CISC3.1 CISC DesignsAn overriding characteristic of CISC machines is an approach to instruction setarchitecture that emphasizes doing more with each instruction. As a result, CISCmachines have a wide variety of addressing modes. CISC machines take a have it yourway approach to the location and number of operands in various instructions. As a resultinstructions are of widely varying length and execution times.3.2 The bridge toward RISC (Historical factors)The capabilities of CISC allowed more operations to be performed into the sameprogram size. During that period, program and data storage were given more importancesince cost of memory was high.An attempt was made to narrow the semantic gap, that is, the gap that existedbetween machine instruction sets and high level language constructs with complicatedinstructions and addressing modes to obtain performance increase. Most of theseimprovements were rejected by compiler writers on the context that they did not fitwell with the language requirements and were of only limited usefulness. At the sametime, research conducted by David Patterson and Donald Knuth showed that 85% of aprograms statements were assignments, conditional or procedure calls. Nearly 80% ofthe assignment statements were MOVE instructions with no arithmetic operations.As more and more capabilities were added to the processors, it was foundincreasingly difficult to support higher clock speeds that would otherwise have beenpossible. Complex instructions and addressing modes worked against higher clockspeeds, because of the greater number of microscopic actions that had to be performedper instruction. Moreover, RAM prices dropped sufficiently so that the pressure onsystem designers was less to design instructions that did more that it was to designsystems that were faster. It was also becoming cost-effective to employ small amounts ofhigher-speed cache memory to reduce memory latency i.e. the writing time betweenwhen a memory is made and when it has been satisfied.

3.3 Why RISC?Various attempts have been made to increase the instruction execution rates byoverlapping the execution of more than one instruction since the earliest day ofcomputing. The most common ways of overlapping are pre-fetching, pipelining andsuperscalar operation.1) Pre-fetching: The process of fetching next instruction or instructions into anevent queue before the current instruction is complete is called pre-fetching. Theearliest 16-bit microprocessor, the Intel 8086/8, pre-fetches into a non-boardqueue up to six bytes following the byte currently being executed thereby makingthem immediately available for decoding and execution, without latency.2) Pipelining: Pipelining instructions means starting or issuing an instruction priorto the completion of the currently executing one. The current generation ofmachines carries this to a considerable extent. The PowerPC 601 has 20 separatepipeline stages in which various portions of various instructions are executingsimultaneously.3) Superscalar operation: Superscalar operation refers to a processor that can issuemore than one instruction simultaneously. The PPC 601 has independent integer,floating-point and branch units, each of which can be executing an instructionsimultaneously.CISC machine designers incorporated pre-fetching, pipelining and superscalar operationin their designs but with instructions that were long and complex and operand accessdepending on complex address arithmetic, it was difficult to make efficient use of thesenew speed-up techniques. Furthermore, complex instructions and addressing modes holddown clock speed compared to simple instructions. RISC machines were designed toefficiently exploit the caching, pre-fetching, pipelining and superscalar methods that wereinvented in the days of CISC machines.4. RISC: Top level Description and guidelinesWe implemented a 16-bit RISC microprocessor based on a simplified version ofthe MIPS architecture. The processor has 16-bit instruction words and 16 general purposeregisters. Every instruction is completed in four cycles. An external clock is used as thetiming mechanism for the control and datapath units. This section includes a summary ofthe main features of the processor, a description of the pins, a high level diagram of theexternal interface of the chip, and the instruction word formats. 16 instructions in the instruction set architecture. 16 general purpose registers. Instruction completion in 4 clock cycles External Clock is used. 14 external address lines.

Fig.4 High Level Block Diagram that describes the external interface of the chip4.1 Instruction Set Architecture (ISA)The ISA of this processor consists of 16 instructions with a 4-bit fixed sizeoperation code. The instruction words are 16-bits long. The following chart describes theinstruction formats.

The Processor features five instruction classes:1. Arithmetic (Twos Complement) ALU operation (2)ADD: Rd = Rs + RtOperands A and B stored in register locations Rs and Rt are added and written to thedestination register specified by Rd.SUB: Rd = Rs - RtOperand B (Rt) is subtracted from Operand A (Rs) and written to Rd.2. Logical ALU operation (6)AND: Rd = Rs & RtOperand A (Rs) is bitwise anded with Operand B (Rt) and written into Rd.OR: Rd = Rs | RtOperand A (Rs) is bitwise ored with Operand B (Rt) and written into Rd.XOR: Rd = Rs ^ RtOperand A (Rs) is bitwise Xored with Operand B (Rt) and written into Rd.NOT: Rd = ~RsOperand A (Rs) is bitwise inverted and written into Rd.SLA: Rd = Rs > 1Operand A (Rs) is arithmetically shifted to the right by one bit and written into Rd. TheMSB (sign bit) will be preserved for this operation.3. Memory operations (3)LI: Rd = 8-bit Sign extended ImmediateThe 8-bit immediate in the Instruction word is sign-extended to 16-bits and written intothe register specified by Rd.LW: Rd = Mem[Rs]The memory word specified by the address in register Rs is loaded into register Rd.SW: Mem[Rs] = RtThe data in register Rt is stored into the memory location specified by Rs.4. Conditional Branch operations (2)BIZ: PC = PC + 1 + Offset if Rs = 0If all the bits in register Rs are zero than the current Program Count (PC + 1) is offset toPC + 1 + Offset. The count is offset from PC + 1 because it is incremented and storedduring the Fetch cycle.BNZ: PC = PC + 1 + Offset if Rs! = 0If all the bits in register Rs are not zero than the current Program Count (PC + 1) is offsetto PC + 1 + Offset.5. Program Count Jump operations (3)JAL: Rd = PC + 1 and PC = PC + 1 + OffsetJump and Link instruction would write current Program Count in register Rd and offsetthe program count to PC + 1 + OffsetJMP: PC = PC + 1 + OffsetUnconditional jump instruction will offset the program count to PC + 1 + Offset.JR: PC = RsJump Return instruction will set the Program Count to the one previously stored in JAL.FETCH INSTRUCTIONPart 1 Retrieve instruction word from main memory Increment Program Counter and store in ALU OutPart 2 Write Incremented Program Count Load Operands into latches from Register File18EXECUTE INSTRUCTIONPart 1 Perform ALU Operation based instruction word and store in ALU Out Move Memory Word into MDR for Load Word operation Write Data into Memory from Register File for Store Word operationPart 2 Write ALU, IR (Immediate), or MDR data into Register File Write new Program Count for Jump Operation or it Branch taken4.2 MICRO-ARCHITECTUREThe micro-architecture refers to a view of the machine that exposes the registers,buses and all other important functional units such as ALUs and counters. The principlesubsystems of a processor are the CPU, main memory and the input/output. The data pathand the control unit interact to do the actual processing task. The control unit receivessignals from the data path and sends control signals to the data oath. These signal scontrol the data flow within the CPU and between the CPU and the main memory andInput/Output.

Program Counter

Fig.4.2.1 Program CounterInstruction Register and Register File

Fig.4.2.2 Instruction Register and RegFile

ALU and Operand Registers

Fig.4.2.3 ALU and Operand RegistersControl Unit DesignThe Control FSM has only three distinct states that determine the operation of theprocessor: IDLE, FETCH and EXECUTE. Here fetch and Execute is further divided intotwo states, Fetch instruction state and Fetch operands state. Similarly Execute state alsodivided into two parts. When the reset signal (reset_s1) goes high from any state, theFSM will be placed in the IDLE state. While in the IDLE state the control unit will sendthe PC write enable signal (pc_wrt_s2 = 1) and select zero (pc_sel_s2 = 0) as the currentProgram count.

Fig.4.2.4 Control unit and Control signalsWhen the reset signal goes low, the FSMs next state will be the FETCH state andthe instruction from Memory address 0 will be loaded into the Instruction Register (IR) tobegin program execution. The control looks at the next state = FETCH and generates theIR write (ir_wrt_s1), Operand A Select (opA_sel_s1), Operand B Select (opB_sel_s1 =0010) and the ALU add operation (alu_op_s1 = 00000001) to load the IR with the nextinstruction and increment the PC by 1. These events all occur on the first clock of theFETCH state. One-hot signals are used for alu_op_s1, opB_sel_s1, and data_sel_s2 tomake for easier decoding in the datapath units. The operation at the next phase of FETCHwill be determined by the opcode (opcode_s2) from the IR, except for the incrementedPC that is written in from the ALU ouput latch in all cases. The ALU Operations willload in Operands A and B from the Register File. The Load word will only need OperandA, while the Store word will need both operands (one for the address and one for the dataword). The Branch instructions will use the offset in its instruction word and PC + 1count as operands into the ALU. The JAL stores the incremented PC in the Register File,while the JR loads the return address into Operand A.After phase two of the FETCH state, the FSM enters the EXECUTE state. Duringthe first phase for an ALU operation, the appropriate alu_op_s1 control signals are sent tothe ALU as decoded from the opcode. The operand mux (opA_sel_s1 & opB_sel_s1)control signals are also generated to select the latch outputs. For the other operations(except LI), an add operation is required from the ALU. The operands chosen for the addare determined by the operation specified. The Load and Store words will access Memoryon this first phase as well. The second phase of EXECUTE writes data into the registerfile or writes a new address into the PC. For the branch instruction, the control will lookat the check zero signal from operand A to determine if the branch should be taken andthe new PC should be written. The control returns the next state to FETCH to repeat theprocess for the next instruction.

Arithmetic Logic Unit An arithmetic and logic unit (ALU) is contained within a central processor unit (CPU). The ALU is a dedicated collection of high speed circuits that performs the arithmetic and logical operations of a computer. The ALU can be physically located adjacent to, or underneath, the processor register. The ALU can be formed in the shape of a square grid.The arithmetic and logic unit works in concert with a control unit, internal memory, and registers. All together, these functions comprise the CPU. Where the ALU performs mathematical computation, logic decisions and processing of data taken from the registers, the control unit itself will read program instructions, farm out tasks of processing to the ALU, and ensure that the proper sequence is followed according to program instructions.Block Diagram

Signals from IDWe receive and use the following signals from the ID stage: ALUSrc Determines whether or not the second operand of the register is the immediate value. ALUOp+func The combined signal of the ALUOp and func fields of each instruction. BNE? Indicates if an instruction is BNE. BEQ? Indicates if an instruction is BEQ. Immediate The sign extended immediate field from ID. Jump? Indicates if an instruction is j-type. JR? Indicates if an instruction is jr. Linked? Indicates if an instruction needs a return address. LUI? Indicates if an instruction is LUI. Op1 The value of the register in the first operand field. Op2 The value of the register in the second operand field. PC+4 The location of the next instruction when the instruction first comes from IF. Shamt The shift amount given to the ALU. ALU OverviewBelow is a table of addresses with their corresponding control bits: Instruction BEQ? BNE? JUMP? LINK? IMM? JR? LUI ALUOp

ADDU 0 0 0 0 0 0 0 ADD

SUBU 0 0 0 0 0 0 0 SUB

AND 0 0 0 0 0 0 0 AND

OR 0 0 0 0 0 0 0 OR

NOR 0 0 0 0 0 0 0 NOR

SLT 0 0 0 0 0 0 0 SLT

SLTU 0 0 0 0 0 0 0 SLTU

SLL 0 0 0 0 0 0 0 SLL

SRL 0 0 0 0 0 0 0 SRL

JR 0 0 1 0 0 1 0 ~

JALR 0 0 1 1 0 0 0 ~

J 0 0 1 0 1 0 0 ~

JAL 0 0 1 1 0 0 0 ~

ADDIU 0 0 0 0 1 0 0 ADD

ANDI 0 0 0 0 1 0 0 AND

ORI 0 0 0 0 1 0 0 OR

SLTI 0 0 0 0 1 0 0 SLT

SLTIU 0 0 0 0 1 0 0 SLTU

LUI 0 0 0 0 1 0 1 LUI

LW 0 0 0 0 1 0 0 ADD

SW 0 0 0 0 1 0 0 ADD

BEQ 1 0 0 0 0 0 0 SUB

BNE 0 1 0 0 0 0 0 SUB

We designed our ALU to respond to the following OpCodes: Op Code SIG0

ADD 000000001 0

SUB 000000001 1

OR 000000010 0

AND 000000100 0

NOR 000001000 0

SLT 000010000 1

SLTU 000100000 1

SLL 001000000 0

SRL 010000000 0

LUI 100000000 0

SIG0 is used to indicate sign in the adder. ALU DesignArithmeticAddition The first operation we designed the ALU to do was addition. We decided to implement this using a ripple carry adder. The advantages of a ripple carry adder are the simplicity of its logic and the ease of extending its logic more places. It's disadvantage is that it's slow. Adders like the carry look ahead adder are much faster. For our design, though, we believe that the ID and MEM stages will be taking up enough time such that the speed differences of the two are not a huge concern. Seeing as it is the first instruction, we gave it the OpCode 0000000010

Our adder diagramThe last zero will be made apparent in the next section. SubtractionThe next operation we designed was the awkward stepbrother of addition: subtraction. Rather than implementing a separate adder dedicated solely to doing addition, we decided to modify our existing adder to do both. Subtraction is simple to addition of one number and the two's complement of another number. That is, the second input is inverted and then a one is added to it. So, we have A-B=A+(!B+1)Because it is addition, we can rearrange it to be A-B=(A+!B)+1Now it is easy to see that the subtraction of A and B is the addition of A and the inversion of B plus 1. So, we place a multiplexor in front of the second input of the adder: one choice is to chose the signal unmodified; the other is to choose the inverted part of B. Then to add one, we set the carry in bit of the adder high. So, in the interest of saving a little bit of logic, we tied the carry in signal of the ALU and the deciding input of the multiplexor to the same bit ... (insert dramatic overture here) ... the last one. So, the subtraction gets the OpCode: 0000000011Addition and subtraction combination circuit.From here on, we will refer to the last bit as SIG0 because it sounds ominous. .

Set Less ThanSLTThe purpose of the set less than operation is te determine whether the first input of the ALU is less than the second, or A

Documents

8. 16- Bit RISC Processor Design for Convolution Application Using Verilog HDL