25
Projects for IC-project and Verification HT2-2010-VT2-2011 You need to select a project as a team (2 members). Please apply by email to [email protected], you need to specify 1 st and 2 nd choice. We need to distribute workload and therefore are we not able to guarantee that you will get your 1 st choice. Please apply no later than Dec 2 nd . MAP Channel Decoder (Reza) Reconfigurable UMTS filter (Deepak) Implementation of a JPEG Accelerator (Johan) Hardware Based Media Player (Yasser) Surveillance System (Isael) RISC track: 1 st draft Mini-MIPS (Chenxin)

Projects for IC-project and Verification HT2-2010-VT2-2011 · The JPEG standard is one ... Translate the software model into VHDL ... All of the task must go through the complete

Embed Size (px)

Citation preview

Projects for IC-project and Verification HT2-2010-VT2-2011

You need to select a project as a team (2 members). Please apply by email to [email protected], you need to specify 1st and 2nd choice. We need to distribute workload and therefore are we not able to guarantee that you will get your 1st choice. Please apply no later than Dec 2nd.

MAP Channel Decoder (Reza)

Reconfigurable UMTS filter (Deepak)

Implementation of a JPEG Accelerator (Johan) Hardware Based Media Player (Yasser) Surveillance System (Isael)

RISC track:

1st draft Mini-MIPS (Chenxin)

IC-Project: MAP Channel Decoder Objective: In this project the students will design a decoder for rate 1/2 convolutional codes generated by [7, 5] encoder. The encoder is memory two and is shown in Figure 1. There are 4 states on the Trellis representation for the encoder of memory two. Decoding requires a more complicated hardware and is done using the BCJR decoding algorithm on a tail-biting structure. Inputs of the decoder are the coded data generated by the encoder (implemented in MATLAB, C++ or any other programming language) and the output of the decoder is the decoded (the original) message. In this project the students only need to implement the decoder in the hardware, not the encoder.

Figure 1. [7,5] convolutional encoder

Grading: Grade: 3

Mandatory: The decoder should work properly according to the known input and the desired output sequences during the verification. A fixed block length for the codes and soft decision for decoding can be considered here. A through comparison with Matlab results are also required. Grade: 4 Mandatory: Tasks for grade 3 + through optimization and verification of the design using power analysis, trying different wordlengths for the inputs and calculations as well as efficient memory utilization. Important design factors such as area and power consumption expected to be optimized to some reasonable extent. All considerations for an excellent design and a well prepared report are taken into account.

Reconfigurable UMTS filter:

This project deals with the implementation of a UMTS filter used in a Wideband-CDMA system. A

generic block diagram of the W-CDMA receiver is shown in Figure 1.

Figure.1: Block diagram of a w-CDMA receiver.

The filter specifications has to satisfy the requirements from the 3GPP standard, out of band signal

attenuation being one of them. The length of the filter that satisfies the 3GPP specification was found

out to be atleast 65 taps in [1].

The difference between the in-band power and the out-of-band power is defined as adjacent channel

selectivity (ACS) and is shown in Figure 2. There can be scenarios where the out-of-band signal power is

either very strong or weak or somewhere in-between. It was also shown in [1] that by measuring the in-

band and out-of-band signals the filter can be operated at relaxed specification and reduce power.

Coefficient optimizations on the filter was carred out in [2].

Figure.2: UMTS filter specification.

Figure.3: Block diagram of the optimized filter architecture.

In [3 ]further optimizations were carried out for ASIC implementations such as, splitting up the filter into

filterbanks to reduce the number of clock domains, taking advantage of early saturation of the signal

power measurement units etc.

This project aims at implementing architecture of the UMTS filter shown in Figure 3.

Challenges:

In logic design: The architecture involves the design of an FIR and an IIR filter as modules. It also consists

of a control unit that takes appropriate decisions to vary the length of the FIR filter from a maximum of

65 taps to a minimum of 5 taps. The control unit also involves the implementation of a hardware

division unit.

In Synthesis and Place&Route: Designing with multiple clock domains, clock dividers.

[1] R. Veljanovski, “A Reconfigurable Root Raised Cosine Filter for a mobile receiver,” Ph.D. dissertation, Victoria University of Technology, 2003. [2] H. Bruce, “Power optimisation of a reconfigurable FIR filter,” Master’s thesis, Lund University, Sweden, 2004. [3] D. Dasalukunte, A. Palsson, M. Kamuf, P. Persson, R. Veljanovski, V. Öwall: Architectural Optimization for Low power in a Reconfigurable UMTS filter,WPMC, San Diego, 2006.

Implementation of a JPEG Accelerator

Johan Lofgren

email: [email protected]

Abstract—The aim of this project is to implement a JPEG accelerator

in hardware. The implementation should be able to both encode and

decode JPEG images. In the encoder chain, there will be a Discrete

Cosine Transform (DCT) unit, Quantizer unit and an Entropy Encoderunit. In the decoder chain the needed blocks are the Entropy Decoder,

the Dequantizer and the Inverse Discrete Cosine Transform (IDCT). The

DCT unit should be based on the algorithm proposed by Arai et al. The

implementation process goes through the basic steps of a ASIC designflow from specification, software modelling, implementation, to synthesis,

and place and route.

I. INTRODUCTION

The JPEG standard is one of the most common image compression

standards available. It is a lossy compression, which means that the

encoded image can normally not be exactly recreated. This allows

for higher compression rates. The encoding/decoding procedure is

suitable for hardware acceleration, because the process consists of

many similar, simple operations. In this project acceleration hardware

is constructed.

A. JPEG Encoding

Image

(8x8 block)

DCTQuantizer

(ZigZag)

Entropy

Encoder

Compressed

Data

Table Table

Fig. 1. JPEG encoding chain

The encoding chain is shown in Fig. 1. First the image data is

transformed with a Discrete Cosine Transform (DCT). This transform

has some nice properties that allows for better compression of the

image. The next stage quantize the output of the DCT. This is the

lossy stage. Here is where information is lost. However, the removed

information is that which the human eye is least sensitive to, and

therefore the image still looks good for the human observer. Finally

the image data is encoded using an entropy (Huffman) encoding, and

stored in compressed form.

B. JPEG Decoding

Image

(8x8 block)

IDCTDequantizer

(ZigZag)

Entropy

Decoder

Compressed

Data

Table Table

Fig. 2. JPEG encoding chain

The decoding is running the process in reverse, as shown in Fig. 2.

First the entropy decoding block is executed, then the dequantizer and

finally the Inverse Discrete Cosine Transform (IDCT) is run. Different

table data is used based on the level of compression used.

C. Discrete Cosine Transform

The DCT is similar to the Discrete Fourier Transform (DFT) and

is defined by the following equation

F (u, v) =

N−1∑

x=0

M−1∑

y=0

cos

[

(2x+ 1)uπ

2N

]

cos

[

(2y + 1)vπ

2M

]

The 2-D DCT has better energy concentration properties than the

DFT, see figure 3. It produces less high frequency components at the

boundaries of the transformation blocks, thus less visual artifacts.

Therefore, the DCT is used in many standardized compression

algorithms, e.g., JPEG and MPEG. As for the DFT, there exists many

efficient implementation algorithms and a good compilation of these

implementations is done by G.S. Taylor and G.M. Blair [1] in which

the algorithm proposed by Arai et al [2] is the most important and

on which this assignment is based on.

1 2 3 4 5 6 7 8

1

2

3

4

5

6

7

8

Fig. 3. Example of energy compaction of an 8 × 8 block using the DCTalgorithm.

II. ASSIGNMENT

The assignment is divided into a number of parts.

Grade 3: In order to pass the course (grade 3), the minimal

requirements are to implement the 2D DCT and the 2D IDCT,

together with the quantizer/dequantizer units.

Grade 4: To receive a grade 4, also the entropy encoder/decoder

needs to be implemented.

Grade 5: A grade 5 requires a full implementation including a

way to display the image, either using a VGA output or an interface

to a computer. Gray-scale images are sufficient.

All Grades: Implement a model in e.g. C or Matlab and generate

testvectors. Translate the software model into VHDL and verify the

funtionality by simulation in Modelsim. Synthesize the design and

perform P&R.

REFERENCES

[1] G. Taylor and G. Blair, “Design for the disrete cosine transform in vlsi,” inComputers and digital techniques, IEE proceedings, vol. 145, mar 1998,p. No. 2.

[2] T. A. Y. Arai and M. Nakajima, “A fast dct-sq scheme for images,” intrans. IEICI, vol. E71, nov 1988, pp. 1095–1097.

IC-Project:

Hardware Based Media player

Objective:

In this project the students will design a hardware based media player application. This

application must have the capability of handling analog stereo sound input and output the

sound on the speakers. The student must display the different frequency bands on the

VGA monitor, with a keyboard based control panel. The panel should be able to control

the volume, balance and frequency attenuations or amplifications.

Grading:

Grade: 3

Mandatory: Handling analog stereo sound input and output the sound on the

speakers. The student must display 10 different frequency bands on the VGA monitor for each channel. You will have to submit a preliminary report explaining how you have

defined your filter bands together with show MATLAB and VHDL implementations.

Grade: 4

Mandatory: Tasks for grade 3 + a keyboard based control panel that should be

able to control the volume, balance and frequency attenuations or amplifications. You

will have to submit a preliminary report explaining how you have defined reconstructed

your audio signal after filtering together with MATLAB and VHDL implementations.

Grade: 5

Mandatory: Tasks for grade 4 + echo feature with a control system to change the

echo duration for atleast two different time periods). You will have to submit a report

explaining how you have defined implemented echo together with MATLAB and VHDL

implementations.

All of the task must go through the complete flow of Digital IC Design and a final report

must be submitted.

Supervisors:

Yasser

IC-Project and Verification - ETIN01Surveillance SystemProject Description

Isael DiazEIT - Lund University

2010 / 2011 Ht2-Vt2

1 Introduction

Nowadays automated surveillance systems are being introduced in modern society more often.One of the factors influencing this trend is that today is possible to include an image sensor innearly any tiny gadget, being a mobile phone the perfect example. Cameras are becoming notonly smaller, and therefore less intrusive to humans, but also humans are getting more used tobe surrounded by cameras.

One additional factor is the high demand on situation awareness. Safety has become one ofthe main concern of the average citizen, companies and governments spend more money thanever in making sure certain area is secure or any thread or danger is detected in good time.Surveillance systems can be designed to cover one or several tasks such as identity tracking,location tracking, activity tracking, etc. All this tasks require a awareness of the system’s scene.Typical questions to be answered by such systems are: Is there someone in the scene? Who isthat someone in the scene?, Is that someone aloud to be in the scene? What is that someonedoing?.

2 Functional Description

Taking into consideration that the approach selected for this project is based on backgroundsubtraction, the first stage will consist of a background subtraction where the pixels not con-sidered as part of the of the background are extracted. Robust background subtraction is aresearch topic on its own. A binary foreground pixel is defined as the subtraction of a currentimage from a background model, denoted by

FG = |I −BG| > Thr (1)

Where FG denotes the binary foreground, I the current pixel value in the scene, BG theknown background model stored in the system, and Thr a threshold is a constant defined bythe illumination and other conditions in the the scene.

Once the foreground is detected, it is necessary to group pixels in blobs where every blobrepresents an independent object moving in the scene, this is accomplish by connecting andclassifying neighboring foreground pixels, this operation is called labeling and consists of tagging

1

Surveillance System 2

Objectmoving

BG Subtraction Object Segmentation

FeatureExtraction

Objecttracking

Figure 1: Object tracking’s stages

all pixels adjacently connected with a common label. All pixels with the same label are definedto belong to the same blob. Pixel’s label can be calculated by Equation 2, where L denotesthe label in the position i, j, P () is the propagation function that generates a new label orpropagates the neighboring label and K denotes the neighborhood dimensions.

L(i, j) = I(i, j) · P

(K∑

n=−K

L(i + n, j − 1)

)(2)

A blob describes an object or group of objects. When pixels are grouped into blobs, featureslike area and center of gravity (COG) can be measured from the maximum and minimum valuesof the blobs in the image. When stereo-images are provided is even plausible to estimate thedepth from the sensors to the object of interest. Foreground identified objects can be seenas objects containing a specific position, occupying certain area, moving with certain speedtowards a specific direction in the scene. All these parameters can be keep in track in order tofire an alarm when an object is moving to close to certain corner of the scene, which could bea entrance to a restricted area.

The Figure 1 shows the different stages from background subtraction to Object tracking.Some examples of final implementations can be extracted from the articles listed in the referencesection at the end of this document.

3 Project

The project consist in developing a number of hardware accelerators contained in a typicalsurveillance system. The number of accelerators to be developed varies according to the desiredgrade in the project.

Figure 2 shows a block diagram of the final surveillance system. The accelerators to bedeveloped are in color pink, while the remaining blocks are to be placed in order to have fullfunctionality. They can be downloaded from Internet (Typically Xilinx website) or obtainedfrom anywhere else.

3.1 Grade 3 - Object Segmentation and Feature Extraction

In order to obtain a grade 3, two functions have to be developed, namely, Object Segmentationand Feature Extraction. The input to this stage is a SVGA size image (800x600) of theforeground stored in the external memory.

The developed block has to be able to extract the foreground image from the externalmemory, group pixels into blocks, or segments, and extract its object features such as size and

Surveillance System 3

Memory Controller

External Memory

Background Model

??

Foreground

BackgroundSubtraction

Filtersstage

FeatureExtraction

ObjectTracking

VGAController

Test platform

ObjectSegmentation

Figure 2: Surveillance system

position. Note that it is crucial to not interfere with the writing of the foreground from theprevious stages. The input image has to be shown in a VGA monitor, which is to be connectedto the test environment, The objects in the foreground have to be enclosed in a green coloredbox. The system has to be able to detect a minimum size of objects of 25x20 pixels.

3.2 Grade 4 - Object tracking

In order to obtain a grade 4, the position and size of the previous labeled objects is registered,extracting its speed and direction. The green colored box enclosing the objects in the foregroundwill change color to red to those objects that are moving faster than n pixels per second. Bydefault n is equal to 30. The system has to be able to follow upto 10 independent objects inthe image.

Instead of a single foreground image, the test environment will store a new foreground imageinto the external memory in regular intervals. e.g. 18 frames per second.

3.3 Grade 5 - Complete Surveillance system

Grade 5 is obtained by completing functionality of the entire system by incorporating a realcamera and performing background subtraction, to create the foreground image utilized by thefollowing stages, previously described. Since the image will be streamed in from real life images,some filtering is advised to get rid of small undesired particles in the image.

The background model will be created from an initial image taken by the camera and storedin the external memory. This initial image has to be clean, that is to say, no moving objects.

3.4 Specifications

• All accelerators have to be placed together in the same silicon core. Try to reuse as manypads as possible.

• For functionality purposes, Xilinx Virtex II Pro development kit will be used with acamera NI-LM9648 with a speed of 18 frames per second (In case of grade 5)

Surveillance System 4

• All blocks or functions must be able to show their functionality on the FPGA as standalone blocks.

• Minimum size objects to be detected is 25x20 pixels (grade 3)

• Calculate and indicate speed for a maximum of 10 objects that move faster than 30 pixelsper second. (grade 4)

3.5 Grading

• For grade 5, the entire surveillance system has to be demonstrated together, not as sep-arate blocks.

• The final grade will be assigned by analyzing project’s: development, realization, verifi-cation and its corresponding written reports.

ETIN01 - Digital IC-proje t & Veri� ation (HT2∼VT2 2010/2011)Mini-MIPS design proje t v.1.0Short project description

This document describes the Mini-MIPS design project which is part of course ETIN01

“Digital IC-project & Verification” conducted at EIT, LTH. Mini-MIPS is a 32-bit RISC

with a simple instruction set similar to the MIPS instruction subset considered in chapter

4 of the textbook: D.A. Patterson & J.L. Hennessy, “Computer Organization and Design -

The Hardware/Software Interface, fourth edition.” (Chapters 5 and 6 in the third edition)

The idea of this project based course is to guide students through a simple microprocessor

design, in order to get a thorough understanding of the basic concepts taught throughout

the preceding course EITF20 “Computer Architecture”. Hence, functional richness of the

CPU design is not the primary goal in this course.

With the aid of SPIM assembly language simulator, ModelSim SE simulator, ASIC

design tools, and FPGA development board, students in the course will:

• Design an executable specification of the Mini-MIPS (Task 1);

• Design a simple 5-stage pipelined implementation of the Mini-MIPS (Task 2);

• Synthesize and Place & Route the pipelined Mini-MIPS in standard ASIC design

flow using 130nm low power CMOS cells (Task 3);

• Verify the pipelined Mini-MIPS in FPGA development board (Task 4);

• Integrate the pipelined Mini-MIPS with a console I/O peripheral (Task 5);

• Adding a memory hierarchy to the pipelined Mini-MIPS; (Task 6 & 7);

• Design and implement an extended instruction set to use MipsIt GCC C compiler

(Task 8).

Prerequisite courses to the Mini-MIPS design project are EITF35 “Introduction to

Structured VLSI Design” and EITF20 “Computer Architecture”, where the VHDL lan-

guage, ModelSim simulator, as well as the background knowledge on computer architec-

ture have been acquainted. Standard ASIC design flow including both hardware integra-

tion, synthesis, and place & route have been introduced in earlier laboratory exercises of

this course.

This project is composed of two parts and is correspondingly evaluated with two dif-

ferent grades. Task 1, 2, 3 and 4 are mandatory in the project, and students will get grade 4

after completing the tasks. For getting a higher grade (grade 5), students should complete

one of the optional assignments among three choices: Task 5, or Task 6 & 7, or Task 8.

1

Contents

1 Introduction 3

2 Mini-MIPS specification 3

2.1 Instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.2 System structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.3 Bus protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Using SPIM for the Mini-MIPS project 6

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2 Memory map and initialization . . . . . . . . . . . . . . . . . . . . . . . 6

3.3 Loading program data into the VHDL model . . . . . . . . . . . . . . . . 7

3.4 Virtual and bare machine . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Compulsory tasks 8

4.1 Task 1: Single cycle behavioural specification . . . . . . . . . . . . . . . 8

4.2 Task 2: Five stage pipeline implementation . . . . . . . . . . . . . . . . 8

4.3 Task 3: ASIC synthesis and place & route . . . . . . . . . . . . . . . . . 10

4.4 Task 4: FPGA verification . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 Overview of design files 11

6 Optional tasks 12

6.1 Task 5: Adding a console I/O peripheral . . . . . . . . . . . . . . . . . . 12

6.2 Task 6: Adding a memory interface . . . . . . . . . . . . . . . . . . . . 12

6.3 Task 7: Adding an instruction- and/or data cache . . . . . . . . . . . . . 12

6.4 Task 8: Using the MipsIt GCC C Compiler . . . . . . . . . . . . . . . . 12

2

1 Introduction

Mini-MIPS is a 32-bit RISC with a simple instruction set similar to the MIPS instruc-

tion subset considered in chapter 4 in the textbook: D.A. Patterson & J.L. Hennessy,

“Computer Organization and Design - The Hardware/Software Interface, fourth edition”

Morgan Kaufmann, 2008 (Chapters 5 and 6 in the third edition). The Mini-MIPS 1.0

implements a true subset of the MIPS I instruction set architecture used in the MIPS

R2000. Like the real MIPS, the Mini-MIPS CPU operates with separate instruction and

data memories and has a 5-stage pipeline with register forwarding, delayed branches and

delayed load.

You are given VHDL description of a Mini-MIPS system consisting of Mini-MIPS

CPU template (which you will have to complete), a clock generator, and two instances

of a memory entity: the instruction memory and the data memory. Your tasks will be to

specify (task 1), design and implement in VHDL (task 2), synthesize (task 3), and verify

(task 4) the Mini-MIPS CPU, simulate the system and run small test programs to verify

the functionalities. You may also choose to do any of the optional tasks (task 5, 6, 7, 8) in

order to get a higher grade.

We use the PC-version of SPIM to develop assembly language programs and to trans-

late these into binary machine code. The VHDL model of the memory entity is capable

of extracting the binary image of a program and its data from a log-file saved from SPIM

and load it into the memory entity. In this way you can execute programs developed using

SPIM on your VHDL model of the Mini-MIPS system.

2 Mini-MIPS specification

2.1 Instruction set

Mini-MIPS is a 32-bit RISC processor with 32-bit instructions and 32-bit data. The in-

struction and data memory address space is 232 bytes, and only word-aligned addressing

is supported in the Mini-MIPS instruction subset. Mini-MIPS has 32 general purpose 32-

bit registers R[0:31]. Two registers are special: R[0] contains the constant 0 (hardwired)

and writes are ignored; R[31] is used to hold return addresses in case of procedure calls

(the jal-instruction). The Mini-MIPS instruction formats are defined in Fig. 1 and the

instruction set is defined in Table 1.

2.2 System structure

The Mini-MIPS CPU uses separate instruction and data memories. This basic configu-

ration is shown in Fig. 2(a). The separate instruction and data memory interfaces of the

Mini-MIPS CPU provide significant freedom in designing the memory system. Fig. 2(b)

shows a simple low-cost system with only one single main memory module, and figure

3(c) shows a high performance system with separate instruction and data caches and a

single main memory module. Notice also that the main memory bus used in Fig. 2(b) and

(c) can be different from the one used by the Mini-MIPS CPU.

3

Opcode Rs Rt Rd Shamt FunctionR-type

31 26 25 021 20 16 15 11 10 6 5

Opcode Rs Rt Immediate/OffsetI-type

Opcode TargetJ-type

Figure 1: Mini-MIPS instruction formats.

(a)

Mini-

MIPS

I-Mem

D-Mem

Mini-

MIPSMem

Bus Interface

(b)

(c)

Mini-

MIPSMem

Bus InterfaceI-Cache

D-Cache

Figure 2: Different Mini-MIPS system configurations.

2.3 Bus protocol

Both data and instruction memory interfaces of the Mini-MIPS use the same protocol.

This implies that I-Mem and D-Mem in Fig. 2(a) can be modelled as instances of the

same VHDL entity, with I-Mem initialized with instructions and D-Mem with data. Fig. 3

shows signals in the memory interface and Fig. 4 shows the protocol. The interfaces are

synchronous and only two transactions are provided: Write Word and Read Word. Also,

the CPU indicates on a clock cycle basis if it is using the bus or not. Signals REQ and

RW are used for this indication (and they are valid just after the rising edge of the clock

and throughout the clock period).

As there are multiple drivers on the DATA bus, different data sources need to be

properly connected to avoid conflicts. Usually, three kinds of the bus connections are

available: OR-, MUX-, and Tristate-bus. Here in this project, OR-bus is chosen due to the

smallest hardware footprint and availability (tristate buffers are not always available, such

as in FPGA).

HOLD signal in the memory interface is raised when memory needs more than one

clock cycle to perform the read or write transaction. In case of a write operation, the

Mini-MIPS should continuously drive the bus as shown in Fig. 4.

4

Table 1: Mini-MIPS instruction set (a true subset of the MIPS R2000 instruction set).

α = 0 for the single cycle Task 1 version, α = 4 for the pipelined Task 2 version. ‘&’

indicates bit-string concatenation; ‘s()’ represents data sign extension.

Inst. 31-26 25-21 20-16 15-11 10-6 5-0 Semantics

Arithmetic

addu X“00” R[s] R[t] R[d] X“00” X“21” R[d] = R[s] + R[t]

addiu X“09” R[s] R[t] Imm R[t] = R[s] + s(Imm)

subu X“00” R[s] R[t] R[d] X“00” X“23” R[d] = R[s] - R[t]

multu X“00” R[s] R[t] X“00” X“00” X“19” LO = ((R[s] * R[t]) ≪ 32) ≫ 32

HI = (R[s] * R[t]) ≫ 32

Logical

and X“00” R[s] R[t] R[d] X“00” X“24” R[d] = R[s] AND R[t]

or X“00” R[s] R[t] R[d] X“00” X“25” R[d] = R[s] OR R[t]

xor X“00” R[s] R[t] R[d] X“00” X“26” R[d] = R[s] XOR R[t]

sll X“00” X“00” R[t] R[d] Shamt X“00” R[d] = R[t] ≪ Shamt (logical)

srl X“00” X“00” R[t] R[d] Shamt X“02” R[d] = R[t] ≫ Shamt (logical)

sra X“00” X“00” R[t] R[d] Shamt X“03” R[d] = R[t] ≫ Shamt (arithmetic)

slt X“00” R[s] R[t] R[d] X“00” X“2A” R[d] = if (R[s] < R[t]) (signed)

then 1D

else 0D

sltu X“00” R[s] R[t] R[d] X“00” X“2B” R[d] = if (R[s] < R[t]) (unsigned)

then 1D

else 0D

Data Transfer

mfhi X“00” X“00” X“00” R[d] X“00” X“10” R[d] = HI

mflo X“00” X“00” X“00” R[d] X“00” X“12” R[d] = LO

lui X“0F” X“00” R[t] Imm R[t] = Imm & X“0000”

lw X“23” X“00” R[t] Offset R[t] = Mem[R[s] + s(Offset)]

sw X“2B” X“00” R[t] Offset Mem[R[s] + s(Offset)] = R[t]

Unconditional jump

j X“02” Target PC = (PC+ α)[31:28] & Target[25:0] & “00”

jal X“03” Target R[31] = PC+ 4+ α

PC = (PC+ α)[31:28] & Target[25:0] & “00”

jr X“00” R[s] X“00” X“00” X“00” X“08” PC = R[s]

Conditional branch

beq X“04” R[s] R[t] Offset PC = if (R[s] == R[t])

then (PC+ α+(s(Offset) ≪ 2))

else (PC+ 4)

bne X“05” R[s] R[t] Offset PC = if (R[s] 6= R[t])

then (PC+ α+(s(Offset) ≪ 2))

else (PC+ 4)

5

REQ

RW

Hold

CLK CLK

Addr

Data

Mini-

MIPSMemory

REQ RW Description

0 X Bus not used

11 Read transaction (load word)

0 Write transaction (store word)

Figure 3: Mini-MIPS memory interface (used for both I-Mem and D-Mem).

R

1

R

2

R

3

W

5W4

Addr3 Addr4 Addr5Addr1 Addr2

CLK

REQ

RW

Hold

Addr

Data

Figure 4: Timing diagram showing read and write transactions.

3 Using SPIM for the Mini-MIPS project

3.1 Introduction

We use SPIM to develop programs for the Mini-MIPS project, i.e. to simulate and debug

the programs, and to translate the symbolic assembly language code into binary machine

code, which is what the real hardware as well as the VHDL models of the Mini-MIPS can

execute. PC-SPIM is introduced in appendix A of the Patterson & Hennessy textbook as

well as in sections “Software” and “Tutorials” on the companion CD.

3.2 Memory map and initialization

The Mini-MIPS implementation conforms to the memory usage conventions of SPIM

and the real MIPS (P&H "COD" 3e, figures 2.17 and A.5.1): A user program must start

at address 0x00400000 (the text-segment). Similarly the data-segment starts at address

0x10000000, and by default SPIM will place data at address 0x10010000 unless the .sdata

or .rdata directive is followed by an address.

The VHDL code modelling of the memory unit contains data structures which popu-

late the following 3 fragments of the address space:

• 0x00000000 − 0x00000017 (Initialization and jump to 0x00400000)

• 0x00400000 − 0x00401ffc (Text segment, program starts at 0x00400000)

• 0x10010000 − 0x10011ffc (Data segment, only dynamic data and stack)

6

At reset the program counter is set to 0x00000000, and the initialization code in the

VHDL model sets up the stack pointer ($sp) & global pointer ($gp) and jumps to address

0x00400000 where your program starts. In the real MIPS as well as in SPIM, the top of the

stack will start at address 0x7fffeffc, and the $sp (R[29]) will be initialized accordingly.

Similarly, $gp (R[28]) is initialized to point to 0x10008000; but this is only used by a

compiler and you may forget all about it for now. As the VHDL model of the memory

populates only a small fraction of the address space, $sp is initialized to 0x10011ffc in

the VHDL model. If your test program does not depend on these initializations, you can

skip the initialization code and simply set the program counter to 0x00400000 and start

your program.

In a similar way SPIM may load an exception program (along with your test program)

which performs similar initializations and jumps to 0x00400000 starting your program.

The exception handler does much more than initializing the above registers, and it is

quite complex. Therefore it is recommended that you do not load it, i.e., deselect “Load

exceptions file” in the settings menu, and remember to manually set the program counter

to 0x0040000 before every simulation run.

3.3 Loading program data into the VHDL model

The VHDL model of the memory entity is capable of: (i) extracting the binary image of a

program and its data from a log-file saved from SPIM, and (ii) loading it into the memory.

In this way you can execute programs developed using SPIM on your VHDL model of

the Mini-MIPS system.

In practice this is handled as follows. Just after opening the assembler source file in

SPIM you save a log file (File-menu, “Save Log File”). Then you copy the log file to your

ModelSim project directory (where the .mpf and .vhd files are), and edit the filename

in the configuration (in test1.vhd or test2.vhd) to conform with the name of the log file.

Make sure that the first word of data at 0x10010000 is non-zero, or the .log file loader

will fail.

As explained later we provide source code for two test programs, and for these we

also provide log files generated with the “Load exceptions file” setting disabled.

3.4 Virtual and bare machine

SPIM supports both the virtual machine and the bare machine instruction set. SPIM is

capable of expanding pseudo-instructions but it is not able to reorder instructions.

When simulating the virtual machine described in Appendix A of the textbook, enable

the settings: “Allow pseudo-instructions”, “Mapped I/O”; and disable “Bare machine”,

“Delayed branches” and “Delayed loads”. This is how you develop test programs for the

single-cycle Mini-MIPS in Task 1.

When simulating the pipelined bare machine, i.e. test programs for the pipelined

Mini-MIPS in Task 2, “Bare machine”, “Delayed branches” and “Delayed load” must

be checked. Note that in this mode, the assembly programs must be written to account

for the branch delay slot and the load-use data hazards, as SPIM is not able to reorder

(reschedule) instructions by itself. Beware of placing pseudo-instructions in the branch

delay slot, since they may be expanded into multiple instructions. The “quick and dirty”

way to do this is to manually insert a nop instruction after every load, branch & jump in-

struction in the assembly source. Optimally the program instructions should be reordered

7

to place useful instructions in the branch delay slots and after load instructions. Verify

that the final program runs OK with “Delayed branches” and “Delayed load” enabled in

SPIM, before running the program on your pipelined Mini-MIPS.

4 Compulsory tasks

4.1 Task 1: Single cycle behavioural specification

Your first task will be to write a top-level behavioural specification of the Mini-MIPS CPU

(as abstract, short and precise as possible). You will be given a test bench corresponding

to the system structure shown in Fig. 2(a). The CPU architecture body includes the data-

and instruction memories in the form of two arrays of 32-bit words. The “memories” are

initialized by reading a test program from a text file. The filename is a generic and it

is specified in the configuration for the test bench. This makes it straightforward to run

different programs on the VHDL model of the Mini-MIPS.

The only thing missing in this model is the VHDL code specifying the behaviour of

the CPU entity. The specification should be a single cycle model (i.e. without pipelining,

delayed branch, delayed load, and register forwarding). An overview of the files given to

you is found in Section 5.

We provide you with two test programs, one for computing Fibonacci numbers and

one for computing square-roots. These two programs will NOT test the CPU completely.

To test all instructions in the instruction set you will need to develop additional test pro-

grams. Try to use each instruction in a few different ways so that all parts of the CPU get

tested.

At reset your Mini-MIPS should start fetching instructions from address 0x00000000

which contains a small start-up program. You can bypass this for debugging purposes and

start from 0x00400000 if you need to.

A) Write a behavioural specification of the Mini-MIPS CPU and run the two test pro-

grams in the VHDL simulator. NB: Make sure to select VHDL-93 in the simulator.

B) Write a test program that tests all instructions in the Mini-MIPS instruction set.

Compare the behaviour of your Mini-MIPS with SPIM.

4.2 Task 2: Five stage pipeline implementation

Your next task is to make an implementation-like behavioural model of the Mini-MIPS

CPU with a 5-stage pipeline (IF, ID, EX, MEM, WB), and the necessary register forward-

ing needed to resolve data hazards in the pipeline. Your CPU should work with external

memory modules operating according to the timing diagram in Fig. 4. Do not forget to

support the HOLD signal. You will be given a VHDL-model of a memory module, and

a test bench instantiating a system composed of a clock generator, your CPU, and two

memories. The memories are loaded from text files specified in the configuration.

Unlike the example in chapter 6 in the textbook, your Mini-MIPS implementation

must have a branch delay slot of only one instruction, similar to the MIPS I ISA’s “de-

layed branches”. This means that calculation of branch conditions and updating of the

PC should be done in the ID-stage of the pipeline. This implies that register forwarding

8

Figure 5: Structuring your VHDL model into processes. Process P1 contains both com-

binational and sequential logic, all outputs are registered. P2comb is pure combinational

logic. P2reg is a sequential process which models a register (using flop-flops).

must also be handled by the ID-stage, which again implies that the ID-stage always de-

livers the proper register operands to the EX-stage. Consequently, the register forwarding

mechanism can be implemented by multiplexers just after the outputs of the register file.

Finally we adopt the “delayed load” MIPS I ISA programming convention that no

instruction will immediately access a register that is loaded by preceding lw instruction.

With two memories, delayed branches, and delayed load, it is now possible to resolve all

remaining hazards with a forwarding unit.

A) Draw a block diagram of your 5-stage Mini-MIPS pipeline (similar to the figures

in chapter 6 of the textbook 3e, for example figure 6.36 or 6.41). If you make this

detailed, nice, and clear it will save you hours of VHDL code debugging.

B) Write a behavioural implementation-like VHDL description of the pipelined Mini-

MIPS. You should organize your description into a number of processes executing

concurrently within a single architecture body. You might want to describe larger

components (ALU, Register file) as separate entities and instantiate them as com-

ponents in the main architecture body of the CPU, as this makes testing easier.

Hints: Each pipeline stage in the Mini-MIPS CPU performs a number of indepen-

dent computations and you can model each of these as separate processes. Fig. 5

illustrates this idea. If the input signals to a register are needed elsewhere in the de-

sign, for example to implement forwarding, these signals must be declared and the

pipeline stage fragment is modelled as a combinational and a sequential part. NB:

The VHDL coding style used in Fig. 5 is synthesizable. There is a branch delay

slot after unconditional jump and conditional branch instructions, i.e. bxx, bxxx, j,

jr, jal, jalr instructions.

9

UART Lite

PLB bus 4.6

MicroBlaze

PC

Xilinx XUP Virtex-II Pro development board

PLB bus driver & User logics

Branch

IF/ID MEM/WB

PC

Register

ID/EXE

... ALU

PGM

...

DM

EXE/MEM

PA

PBDebugging

data path

Figure 6: An illustration of the test set-up for Mini-MIPS using Xilinx MicroBlaze.

C) Develop additional test programs to test register forwarding from all stages, delayed

load and delayed branches. Make sure that all data paths in the CPU are tested.

4.3 Task 3: ASIC synthesis and place & route

Synthesize and Place & Route Mini-MIPS in standard ASIC design flow using 130nm

low power CMOS cells. Make sure the CPU is correctly implemented by doing a back-

annotated timing simulation. You should also determine the longest combinational path

and maximum clock frequency of your Mini-MIPS CPU, as well as the hardware re-

sources used.

4.4 Task 4: FPGA verification

Synthesize and Place & Route Mini-MIPS towards FPGA implementation in Xilinx ISE,

and verify the functionality using Xilinx XUP Virtex-II Pro development board.

Hints: In order to send data into and read data out from the Mini-MIPS, you should

design a proper verification environment around the CPU to ease the tests. One simple

approach is to use a master processor to drive the Mini-MIPS. You might for instance

build a testbed in Xilinx EDK, where attach the Mini-MIPS as a slave processor and a

RS232 module as a communication interface to the common PLB data bus, and drive

the bus by the Xilinx MicoBlaze microprocessor. The RS232 interface can be used to

transfer data between the MicroBlaze and external PC, where MicroBlaze further sends

data in and out from the Mini-MIPS. The block diagram of this test set-up is illustrated in

Fig. 6.

10

5 Overview of design files

The following files are available. You can download them from course homepage.

types.vhd: defines common types for the rest of the modules.

clock.vhd: defines the clock generator.

Task 1

cpu1.vhd: defines the CPU entity. It includes the necessary memory initializations.

Your task is to fill in VHDL code to model the behaviour of a single-cycle Mini-MIPS

CPU. The entity cpu1 has a generic parameter called programname. This parameter is

used to specify which program is executed by the CPU. The file containing the program

is obtained by loading an assembly file (for example fib1.s) into the SPIM simulator and

then saving the log file. The log file obtained has all the information necessary for the

CPU to run the program.

mem1.vhd: defines the memory package. It includes procedures for accessing the

instruction and data memory.

test1.vhd: defines the test bench. It instantiates two components: A clock generator

and a CPU (cpu1). Two different configurations are given. testfib is used to execute the

Fibonacci test program and testsqrt is used to execute the square root test program.

fib1.s & fib1.log: the source code and the log file for the Fibonacci test program.

sqrt1.s & sqrt1.log: the source code and the log file for the square root test program.

Task 2

cpu2.vhd: defines the CPU entity with ports to the external memory components.

The body of the entity has been stripped off, it is your task to develop a 5-stage pipelined

version of the CPU.

mem2.vhd: defines the memory entity. Note that the entity has two generic parame-

ters: filename that defines the name of a binary memory image to load; and n that defines

the number of hold cycles (wait states) that the memory instance inserts at each access.

By setting this parameter larger than zero, you can test that your CPU model responds

correctly to the hold signal.

test2.vhd: defines the test bench. Notice how PORT MAP statements are used to

connect instances of the clock, cpu, and memory entities. Two different configurations

are given. testfib is used to execute the Fibonacci test program and testsqrt is used to

execute the square root test program. Notice how wait states are inserted for data memory

accesses. Notice also that the same binary image is loaded into the instruction memory

and the data memory. The reason for not separating instructions and data is that the test

programs can then also be used to test the the system configurations shown in figures 3(b)

and 3(c).

fib2.s & fib2.log: the source code and the log file for the Fibonacci test program. fib2

differs from fib1 by using delayed branches.

sqrt2.s & sqrt2.log: the source code and the log file for the square root test program.

sqrt2 differs from sqrt1 by using delayed branches.

11

6 Optional tasks

This section specifies a number of optional tasks. You will need to complete at least one

group of tasks (Task 6 and 7 belong to one group) in order to acquire a higher grade.

6.1 Task 5: Adding a console I/O peripheral

The Mini-MIPS is quite useless without any I/O communication with the external world.

In this task you need to add an I/O bus interface into the Mini-MIPS, and attach at least one

peripheral module onto the bus, such as the I/O switches and LEDs, RS232 interface, etc.

The console I/O peripheral can be connected to the data memory interface of the Mini-

MIPS, with suitable address decoding for the memory-mapped registers. For example,

you can allocate a range of addresses in the memory space for the peripherals, and use

an I/O console to do the proper address decoding and to manage the operation of the

peripherals. You may also consider to add an interrupt controller to further improve the

usability of the Mini-MIPS. For example, when connecting a timer to the I/O bus, CPU

needs to be interrupted upon the timer event.

PCSpim is able to simulate the console hardware (Window->Console), so you can test

simple console programs. Four hardware registers are memory-mapped to the address

space 0xFFFF0000 ∼ 0xFFFF000F (16 bytes):

0xFFFF0000 Receiver control register. Bit 0: ready, Bit 1: keyboard interrupt

enable.

0xFFFF0004 Receiver data register: Lower 8 bits: last character typed.

0xFFFF0008 Transmitter control register: Bit 0: ready, Bit 1: transmit interrupt

enable.

0xFFFF000C Transmitter data register: Lower 8 bits: character to be sent.

6.2 Task 6: Adding a memory interface

Design a bus interface and implement the system structure shown in Fig. 2(b) that uses

only one memory module. This may be considered as the Von Neumann model. Create an

entity called minimips3 that connects the CPU from task 2 to the bus interface entity. Hint:

You’ll need to be able to stall the CPU, as memory in this set-up holds both instructions

and data.

6.3 Task 7: Adding an instruction and/or data cache

Add an instruction or data cache between the CPU and memory bus interface as illustrated

in Fig. 2(c). Start your design with simple cache configurations, such as a direct-mapped

write-through cache, and gradually increase the hardware complexity.

6.4 Task 8: Using the MipsIt GCC C Compiler

MipsIt is a simplified GCC C cross-compiler for MIPS, with a Windows interface. By

extending the instruction set to include the instructions listed in Table 2 and Table 3, it

is possible to compile C-programs and execute them on the Mini-MIPS. The following

12

describes how to generate an assembly language file for SPIM. MipsIt software can be

downloaded from the course homepage.

Locate and run “MipsIt.exe” in the bin directory of the place MipsIt is installed. To

begin select “File->New” in the menu and create a new “C(minimal)/Assembler” Project.

Create a new C file (or add an existing). When the C file is opened, select “Build->View

Assembler” to compile the C program into readable MIPS assembly code. Save the as-

sembler code in a .s file and load it in SPIM. Note that GCC does not know about delayed

load and delayed branches, so to run the code in SPIM, “delayed load” and “delayed

branches” must be disabled. To use the GCC output with the pipelined Task 2 Mini-

MIPS, you must manually modify the assembly code as described above, e.g. reschedule

your code or insert nop instruction to utilize the delayed slot. Other guidelines for using

GCC with Mini-MIPS are listed below:

• No C libraries are available for Mini-MIPS, so you’ll have to do without <stdio.h>

and so on.

• Since the Mini-MIPS does not include instructions for loading and storing bytes

and half-words, the char and short data types cannot be used, just use the int type

for everything.

• Mini-MIPS does not include the div instructions, so the C operators ‘/’ and ‘%’

should only be used with constants.

• Q: What is _main()? it doesn’t exist!

A: The MipsIt C compiler will insert a call to _main() to do some initialization

before running your program. Easy fix: define it as an empty function:void _main(void) {}• Q: Why does my program (e.g. sqrt1.s) read from the stack?

A: PCSpim inserts some C startup code before your program, just let it run and your

program will eventually start at address 0x00400024. To disable this, change .text

in “sqrt1.s” to .text 0x00400000 and the startup code will not be there.

13

Table 2: Extended Mini-MIPS instruction set supporting the MipsIt GCC C compiler.

Extended instructions are shown in bold. α = 4 for the pipelined processor. ‘&’ indicates

bit-string concatenation; ‘s()’ represents signed extension; ‘us()’ represents unsigned ex-

tension.

Inst. 31-26 25-21 20-16 15-11 10-6 5-0 Semantics

Arithmetic

addu X“00” R[s] R[t] R[d] X“00” X“21” R[d] = R[s] + R[t]

addiu X“09” R[s] R[t] Imm R[t] = R[s] + s(Imm)

subu X“00” R[s] R[t] R[d] X“00” X“23” R[d] = R[s] - R[t]

multu X“00” R[s] R[t] X“00” X“00” X“19” LO = ((R[s] * R[t]) ≪ 32) ≫ 32

HI = (R[s] * R[t]) ≫ 32

Logical

and X“00” R[s] R[t] R[d] X“00” X“24” R[d] = R[s] AND R[t]

or X“00” R[s] R[t] R[d] X“00” X“25” R[d] = R[s] OR R[t]

xor X“00” R[s] R[t] R[d] X“00” X“26” R[d] = R[s] XOR R[t]

sll X“00” X“00” R[t] R[d] Shamt X“00” R[d] = R[t] ≪ Shamt (logical)

srl X“00” X“00” R[t] R[d] Shamt X“02” R[d] = R[t] ≫ Shamt (logical)

sra X“00” X“00” R[t] R[d] Shamt X“03” R[d] = R[t] ≫ Shamt (arithmetic)

slt X“00” R[s] R[t] R[d] X“00” X“2A” R[d] = if (R[s] < R[t]) (signed)

then 1D

else 0D

sltu X“00” R[s] R[t] R[d] X“00” X“2B” R[d] = if (R[s] < R[t]) (unsigned)

then 1D

else 0D

nor X“00” R[s] R[t] R[d] X“00” X“27” R[d] = R[s] NOR R[t]

andi X“0C” R[s] R[t] Imm R[d] = R[s] AND us(Imm)

ori X“0D” R[s] R[t] Imm R[d] = R[s] OR us(Imm)

xori X“0E” R[s] R[t] Imm R[d] = R[s] XOR us(Imm)

sllv X“00” R[s] R[t] R[d] X“00” X“04” R[d] = R[t] ≪ R[s][4:0] (logical)

srlv X“00” R[s] R[t] R[d] X“00” X“06” R[d] = R[t] ≫ R[s][4:0] (logical)

srav X“00” R[s] R[t] R[d] X“00” X“07” R[d] = R[t] ≫ R[s][4:0] (arithmetic)

slti X“0A” R[s] R[t] Imm R[d] = if (R[s] < s(Imm)) (signed)

then 1D

else 0D

sltiu X“0B” R[s] R[t] Imm R[d] = if (R[s] < s(Imm)) (unsigned)

then 1D

else 0D

14

Table 3: Extended Mini-MIPS instruction set supporting the MipsIt GCC C compiler,

continued. Extended instructions are shown in bold. α = 4 for the pipelined processor.

‘&’ indicates bit-string concatenation; ‘s()’ represents signed extension; ‘us()’ represents

unsigned extension.

Inst. 31-26 25-21 20-16 15-11 10-6 5-0 Semantics

Data Transfer

mfhi X“00” X“00” X“00” R[d] X“00” X“10” R[d] = HI

mflo X“00” X“00” X“00” R[d] X“00” X“12” R[d] = LO

lui X“0F” X“00” R[t] Imm R[t] = Imm & X“0000”

lw X“23” X“00” R[t] Offset R[t] = Mem[R[s] + s(Offset)]

sw X“2B” X“00” R[t] Offset Mem[R[s] + s(Offset)] = R[t]

Unconditional jump

j X“02” Target PC = (PC+ α)[31:28] & Target[25:0] & “00”

jal X“03” Target R[31] = PC+ 4+ α

PC = (PC+ α)[31:28] & Target[25:0] & “00”

jr X“00” R[s] X“00” X“00” X“00” X“08” PC = R[s]

jalr X“00” R[s] X“00” R[d] X“00” X“09” R[d] = PC+ 4+ α (R[d] is usually R[31])

PC = R[s] (R[s] and R[d] must be different)

Conditional branch

beq X“04” R[s] R[t] Offset PC = if (R[s] == R[t])

then (PC+ α+(s(Offset) ≪ 2))

else (PC+ 4)

bne X“05” R[s] R[t] Offset PC = if (R[s] 6= R[t])

then (PC+ α+(s(Offset) ≪ 2))

else (PC+ 4)

bltz X“01” R[s] X“00” Offset PC = if (R[s] < 0)

then (PC+ α+(s(Offset) ≪ 2))

else (PC+ 4)

bgez X“01” R[s] X“01” Offset PC = if (R[s] >= 0)

then (PC+ α+(s(Offset) ≪ 2))

else (PC+ 4)

blez X“06” R[s] X“00” Offset PC = if (R[s] <= 0)

then (PC+ α+(s(Offset) ≪ 2))

else (PC+ 4)

bgtz X“07” R[s] X“00” Offset PC = if (R[s] > 0)

then (PC+ α+(s(Offset) ≪ 2))

else (PC+ 4)

15