IDT 79R3081 Microprocessor

IDT 79R3081 MicroprocessorIDT 79R3081 Microprocessor

Stephen Fu

SID# 069-78-0229

May 1st, 2002

Prof. Robert Dewar

OverviewOverview

Integrated Device Technology, Inc. (IDT) delivers advanced communications integrated circuit solutions that enhance network performance, bandwidth and quality of service to accelerate time to market for leading communications companies.

The IDT R30xx family is a series of high-performance 32-bit microprocessors featuring a high-level of integration and are targeted at high-performance, but cost sensitive processing applications.

The R30xx family inherits the high-performance of the MIPS RISC architecture and brings it into low-cost, power sensitive applications.

The 79R3081 extends the capabilities of the R30xx family by integrating additional resources into the same pin-out, thus extending the range of applications addressed by the R30xx family.

Features on the R3081Features on the R3081

Instruction set is compatible with earlier IDT79R3000A, R3041, R3051, and R3071 RISC CPUs.

Can execute over 40 MIPS at 50 MHz without requiring external SRAM or caches

An on-chip Floating Point Accelerator (FPA) Large on-chip caches that are user configurable:

16kB Instruction Cache, 4kB Data Cache Dynamically configurable to 8kB Instruction Cache and 8kB Data Cache

Hardware-based Cache coherency support On-chip 4-deep write buffer eliminates memory write stalls On-chip 4-deep read buffer supports burst or simple block reads

Features (cont.)Features (cont.) On-chip DMA arbiter On-chip clock doubler to provide higher frequency signals to the internal execution core Uses a five-stage pipeline Uses a fixed segment-based memory mapping scheme that omits the TLB Has an extended version, the R3081E, which incorporates a full function memory management

unit (MMU) including a 64-entry fully associative Translation Lookaside Buffer (TLB) Flexible and multiplexed bus interface with support for low-cost, low-speed memory systems

with a high-speed CPU A full 32-bit RISC integer execution engine, capable of sustaining close to single cycle execution

Register SetRegister Set

R3081 contains thirty-two 32-bit integer registers, a 32-bit Program Counter, and two dedicated 32-bit registers which hold the result of an integer multiply or divide operation.

The 32 general registers are treated symmetrically with two exceptions: Register r0 is hardwired to a zero value and when used as a source register, it

allows different addressing modes, no-ops, register or memory clear operations without requiring expansion of the basic instruction set.

Register r31 is used as the link register in jump and link instructions. During subroutine calls, the return address is placed in this register.

The two dedicated registers (HI and LO) store the double-word, 64-bit result of integer multiply operations, and the quotient and remainder of integer divide operations.

Special Co-Processor RegistersSpecial Co-Processor Registers In addition to the general CPU registers, the R3081 contains a number of

special registers on-chip. Some registers logically reside in the on-chip System Control Co-

processor (CP0), and are used in memory management and exception handling.

The Floating Point Accelerator also resides on-chip and operates as Co-Processor 1 (CP1).

The use of these registers will be deferred to a later slide.

Instruction SetInstruction Set

All instructions are 32-bits long, and there are only three basic instruction formats: I-Type J-Type R-Type

This approach dramatically simplifies instruction decoding, permitting higher frequency operation.

More complicated, but less frequently used, operations and addressing modes are synthesized by the compilers, using sequences of the basic instruction set.

The instruction set can be divided into three basic groups: Load / Store Computational Jump and Branch

Load and Store InstructionsLoad and Store Instructions

Load and store instructions move data between memory and general registers.

They are all encoded as “I-Type” instructions. The addressing mode implemented is base register plus 16-

bit signed, immediate offset. This can be used to directly implement immediate addressing using Register r0.

All load operations have a latency of one instruction. The instruction opcode determines the size of the data item

to be loaded or stored. For example, LB is “load byte”, SW is “store word”.

An exception is the target register for the “load word left” and “load word right” instructions, which may be specified as the same register used as the destination of a load instruction that immediately precedes it.

Computational InstructionsComputational Instructions

Computational instructions perform arithmetic, logical and shift operations on values in registers.

They occur in both “R-Type” (when both source operands are general registers) and “I-Type” (when one of the source operands is a 16-bit immediate value) formats.

They use a three address format, so that operations would not interfere with the contents of source registers.

Examples: ADD rd, rs, rt Adds contents of registers rs and rt and place 32-bit result in register

rd. SLL rd, rt, shamt Shift contents of register rt left by shamt bits, inserting zeroes into low

order bits. Place 32-bit result in register rd.

Jump and Branch InstructionsJump and Branch Instructions

Jump and branch instructions change the control flow of a program.

Jump instructions can be encoded in “J-Type” format: The 26-bit target address is shifted left two bits, and combined with the

high-order 4 bits of the current program counter to form a 32-bit absolute address.

This form is used for subroutine calls. Jumps can also be encoded in “R-Type” format:

Its target address is a 32-bit value contained in one of the general registers.

This form is used for returns and dispatches. Branch instructions are encoded in “I-Type” format:

The target address is formed from a 16-bit offset relative to the program counter.

Addressing ModesAddressing Modes

The R3081 provides very simple addressing modes. The only addressing mode implemented is base register

plus signed, immediate offset. This enables the use of three distinct addressing modes:

Register plus offset Register direct Immediate

The bytes within the addressed word that are used can be determined directly from the access size and the two low-order bits of the address.

The endian-ness of a given access is dynamic, in that the RE (Reverse Endianness) bit of the Status Register on the Co-Processor can be used to force user space accesses of the opposite byte convention of the kernel.

Cache ArchitectureCache Architecture

To maximize performance, the R3081 implements a Harvard Architecture caching strategy, which is to separate the caches into two parts: One part contains instructions, or operations, called Instruction Cache. Another contains data, or operands, called Data Cache.

Each main memory address can be mapped to only one particular cache location.

The address presented to the cache and cache controller is that of the physical (main) memory element to be accessed.

The operation of the on-chip caches is automatically handled by the processor.

In case of a cache miss, the processor will enter stall cycles until the bus interface unit indicates that it has obtained the necessary data.

Instruction CacheInstruction Cache

The R3081 implements a 16KB instruction cache, which can be reduced to 8KB but the data cache will have to be increased to 8KB as well.

The cache is organized with a line size of 16 bytes to ensure four adjacent words from main memory.

This cache can achieve hit rates to over 98% in most applications.

It is also capable of caching instructions from anywhere within the 4GB physical address space.

It is implemented using physical addresses and does not require flushing the cache on context switch.

Data CacheData Cache

The R3081 incorporates a default data cache of 4KB, which can be reconfigured to 8KB.

The cache is organized as a line size of bytes, or one word. This relatively large data cache achieves hit rates to over

95% in most applications. The data cache is also implemented as a direct mapped

physical address cache that is capable of mapping to any word within the 4GB address space.

The data cache is implemented as a write-through cache, to insure that main memory is always consistent with the internal cache.

Cache CoherencyCache Coherency

The R3081 provides support for hardware based cache coherency.

The cache coherency mechanisms were designed mainly to support cache coherency in DMA environment.

The R3081 has a DMA arbiter to coordinate the external requests for mastership with the CPU read and write interface: Non-coherent DMA requests have to highest priority and is guaranteed to

gain mastership at the next arbitration. Coherent DMA requests have read buffers that must be emptied to the

caches and write buffers written to main memory before the bus is granted, to insure memory coherency.

During DMA writes, the processor can be directed to invalidate the cache lines corresponding to the current DMA writes that may be inconsistent and thus get the current value from main memory.

Pipeline ArchitecturePipeline Architecture

The R3081 utilizes a 5-stage pipeline design to achieve an execution rate approaching one instruction per cycle.

The five stages are: Instruction Fetch (IF)

In this stage, the instruction virtual address is translated to a physical address and the instruction is read from the Instruction Cache.

Read (RD) During this stage, the instruction is decoded and required operands are read from the register file.

ALU The required operation is performed on the instruction operands.

Memory Access (MEM) Data Cache is accessed if the instruction was a load or store.

Write Back (WB) The results obtained from the ALU stage operation are updated into the register file.

The pipeline operates efficiently because different CPU resources such as address and data bus access, ALU operations, and the register file are all utilized on a non-interfering basis.

Pipeline HazardsPipeline Hazards

Pipeline hazards occur when the current pipestage of an instruction require the result of a previous instruction whose result is not yet available and still resides in the pipeline.

To handle this, a logical unit within the execution engine of the R3081 forwards the result of instruction n’s ALU operation to instruction n+1, prior to the true writeback operation.

Pipeline hazards can also be handled in hardware during integer multiply and divide operations: If an instruction attempts to access the HI or LO registers before the

completion of the multiply and divide, that instruction will be interlocked until the operation completes.

Pipeline Hazards (cont.) Pipeline Hazards (cont.)

Two categories of instructions utilize software intervention to insure logical operation: Load instructions have a delay, or latency, of one cycle before data

loaded from memory is available for another instruction. Jump and Branch instructions also have a delay of one cycle before

the program flow change can occur.

In these cases, the CPU continues execution, despite the delay in the operation.

The CPU gives responsibility for dealing with “delay slots” to software.

It can insert an instruction that does not require the logical result of the delayed instruction into the delay slot. Or in the worst case, a NOP can be inserted.

System Control Co-Processor System Control Co-Processor (CP0)(CP0) The System Control Co-Processor contains registers to

manipulate the memory management and exception handling facilities of the processor.

Co-Processor Loads and Stores are all encoded as “I-Type” instructions.

Co-Processor computational instructions have co-processor dependent formats.

Memory ManagementMemory Management

There are two unique privilege states: Kernel mode - four distinct virtual address segments:

Kuseg - the kernel has the same virtual address as a user process and therefore can directly access user memory regions.

Kseg0 - a 512 MB segment, beginning at virtual address 0x8000_0000. Typically used for kernel executable code and some kernel data.

Kseg1 - a 512 MB segment, beginning at virtual address 0xa000_0000. Typically used for I/O registers, boot ROM code, and operating system data.

Kseg2 - this is the same as segment kuseg, but is accessible only from kernel mode. It contains 1 GB of linear addresses, beginning at virtual address 0xc000_0000.

User mode - a single, uniform virtual address space of 2 GB: Kuseg begins at virtual address 0 and extends linearly for 2 GB. This segment is used to hold user code and data, and the current user processes.

For either states, the virtual to physical address translation depends on whether the processor is a base or extended architecture version.

Base Version R3081Base Version R3081

The base version R3081 provide segment-based virtual to physical address translation without requiring extensive virtual page management.

Kuseg is always translated to a contiguous 2 GB region of the physical address space.

Virtual addresses in kseg2 are directly mapped to physical addresses, unchanged.

Reserved spaces are available for compatibility with future family members.

The distinction between user tasks and kernel tasks can be implemented by decoding the output physical address.

Extended Version R3081EExtended Version R3081E

The extended version R3081E provides a full featured MMU which uses an on-chip TLB and dedicated registers in CP0 to provide software management of page tables.

The MMU maps 4 KB virtual pages to 4 KB physical pages. Kuseg and kseg2 may be mapped anywhere in the 4 GB

physical address space. The System Control Co-Processor (CP0) contains the TLB

and other registers to perform address translation. The TLB is a fully associative memory that holds 64 entries

and provides a mapping of 64 4 KB pages. Pages are mapped by substituting a 20-bit physical frame

number for a 20-bit virtual page number.

Exception HandlingException Handling

The exception handling capabilities of the R3081 is provided to assure an orderly transfer of control from an executing program to the kernel.

It handles exceptions such as TLB misses, arithmetic overflows (integer or floating point), I/O interrupts, system calls, breakpoints, etc.

Whenever an exception occurs, the processor aborts the instruction causing the exception, as well as instructions following in the exception pipeline which have already begun, and then performs a direct jump into the exception handler routine.

The System Co-Processor (CP0) contains registers which will be examined during exception processing.

Exception Handling RegistersException Handling Registers

Cause Register: A 5-bit exception code indicates the cause of the current exception. The remaining fields contain detailed information specific to certain

exceptions.

EPC (Exception Program Counter) Register: Contains the virtual address of the instruction which took the exception

BadVAddr Register: Saves the entire bad virtual address for any addressing exception and

provides information useful for a software TLB exception handler

Status Register: Contains a three level stack (current, previous, old) of the kernel/user

mode bit The stack is pushed when each exception is taken, and popped by the

Restore From Exception instruction.

Handling Specific ExceptionsHandling Specific Exceptions

Address Error Exception: Occurs when an attempt is made to load, fetch, or store a word that is not

aligned on a word boundary The EPC register points at the instruction that caused the exception. The BadVAddr register contains the virtual address that was not properly

aligned. The kernel hands the executing process a segmentation violation signal.

Overflow Exception: Occurs when an ADD, ADDI, SUB, or SUBI instruction results in two’s

complement overflow The kernel hands the executing process an error which is fatal.

TLB Miss Exception: Occurs when a kernel/user mode virtual address reference to memory

matches an invalid TLB entry The operating system loads the appropriate Page Table Entry into the TLB.

Floating Point Accelerator Floating Point Accelerator (CP1)(CP1) The R3081 contain an on-chip Floating Point Accelerator

(FPA), which operates as a coprocessor to perform arithmetic operations on values in floating-point representations.

It has full 64-bit operation. The FPA uses a load/store instruction set, with single-cycle

loads and stores. The FPA connects with the Integer Processor to form

integration of floating-point and fixed-point instruction sets.

FPA Register SetFPA Register Set

32 32-bit Floating-Point General Registers: Directly addressable registers used in floating point operations and

individually accessible via move, load, and store instructions

16 64-bit Floating-Point Registers: Logical registers used to store data values during floating-point

operations Formed by concatenating two adjacent floating-point general registers Hold values in either single or double precision floating-point format

2 Floating-Point Control Registers: Can be accessed only by Move operations Used to control and monitor exceptions

FPA OperationsFPA Operations

The FPA performs both 32-bit (single-precision) and 64-bit (double-precision) IEEE standard floating-point operations: 32-bit format has a 24-bit signed-magnitude fraction field and an 8-bit

exponent. 64-bit format has a 53-bit signed-magnitude fraction field and an 11-bit

exponent.

Load, Store, and Move Operations: These operations move data between memory, the main processor and the

FPA general registers. These operations perform no conversions and cause no floating-point

exceptions.

Arithmetic Operations: 3-Operand Register-Type instructions perform floating-point addition,

subtraction, multiplication, and division operations. 2-Operand Register-Type instructions perform floating-point absolute value,

move, and negate operations.

FPA Instruction PipelineFPA Instruction Pipeline

The FPA provides an instruction pipeline that parallels that of the Integer Processor.

The FPA has 6-stage pipeline instead of the 5-stage pipeline of the Integer CPU.

The additional FPA pipe stage (FWB) is used to provide efficient coordination between the FPA and main processor: During the FWB stage, the FPA write back ALU results to its register

file.

To lessen the impact of frequently stalling pipeline, the FPA allows overlapping of instructions so that instruction execution can proceed.

ConclusionConclusion

Since IDT introduced its 32-bit microprocessor R3081 in 1992, it has been continuously competing with the best embedded PowerPC or StrongARM processors for better performance and newer designs.

IDT is starting to redesign its R3000 processor core. The new core is an IDT-only derivative that first appears in the company's new RC32364 chip, sampling now.

IDT also added a multiply-accumulate instruction (MAC), which is not part of any official MIPS instruction set: It multiplies a pair of 32-bit floating-point values and add the result to

another floating-point value in a single operation. IDT is still using the 32-bit processor core, but provides many of the

functions that have only been available on a 64-bit RISC processor. The R3081 is being used on the TC-702-integrated Internet TV

microprocessor.

ReferencesReferences

http://www.idt.com/products/pages/Standalone_Processors-79R3081.html

http://www.fulcrum.ru/Documents/CDROMs/IDT/docs/rp00001/rp0019c.htm

http://www.mdronline.com/publications/epw/issues/epw3.html

http://www.eetimes.com/news/98/1012news/idt.html http://www.windriver.com/html/ces_sy.html