Tampere University of Technology Institute of Digital and Computer Systems 1 COFFEE TM RISC Core Design Case

1 Tampere University of TechnologyInstitute of Digital and Computer Systems

COFFEE TM RISC Core Design Case


Outline of Coffee Case

COFFEE TM RISC core design case– Basic architecture

– Defining the “perfect instruction set”

– Design decisions

– Results

– Conclusions


COFFEE TM RISC Processor

Background– An embedded processor for SoC and processor architecture

research and education was needed– No access to commercial cores (especially the internal

implementation) We had to design our own core!

Architecture and implementation “prerequisites”– Clean RISC style was preferred– Control dominance but some DSP operations were foreseen– Balanced pipeline with current technologies– Technology independence

Consequences– Full multiplier and barrel shifter in the pipeline– 6 stages to keep the balance (3 +1 execution stages for the multiply)– Synthesizable VHDL implementation for technology independence


Following RISC Philosophy

Issue rate of one instruction per cycle (not extending the cycle, but pipelining!)

Only load and store instructions access memory breaking the operation to two – three instruction cycles, not extending

Fixed instruction length fast and compact decoding logic Simplified addressing No hardware penalty from address

arithmetic Few simple operations short clock cycle, but larger amount of

code memory Delayed load and branches slots can be filled with meaningful

instructions by careful scheduling in order to hide the latencies Let the compiler do it (scheduling is important in RISC

processors)


COFFEE TM Block Diagram

PROGRAMMEMORY

DECODELOGIC

EX1 EX2 EX3WRITEBACK

DATAMEMORY

FORWARDINGLOGIC

PCCONTROL


Designing the ”Perfect Instruction Set”

Trade-off between the amount of work done per instruction and the time taken to execute an instruction– Estimating the execution time of an operation requires knowledge

about hardware implementation, which is not available at design time

– Comparing to existing implementations gives ideas but leads to reproducing the same ‘errors’

– More formal approach is needed: Analysis of complexity of operations implemented using different algorithms gives some measures, but is not very practical

– The Coffee Way: Keeping the instruction granularity on full-size arithmetic operations seemed as a safe choice


Designing the ”Perfect Instruction Set” (cont’d)

Trade-off between compiler complexity and hardware complexity– Typically pipelined high performance RISC processors require

sophisticated compilers in order to fully exploit hardware performance

– Constructing high level operations from basic RISC instructions efficiently requires both knowledge of underlying hardware and knowledge of different algorithms

– To reduce complexity of a compiler and improve performance hardware support for resolving data dependencies may be provided. This comes in form of forwarding and stall logic.

– The Coffee Way: Extensive forwarding is used to simplify the construction of the compiler

• Very little software optimization needed to avoid data dependence

• Highest possible throughput



Trade-off between the amount of memory consumed by program code and the complexity of instructions– Typically CISC instructions are more compact and ‘do more’ than

RISC instructions

– A ‘cure’ for this is alternative 16-bit encoding found e.g. in ARM

– The Coffee Way: 16/32-bit encodings are provided Trade of between the amount of instructions and complexity of

decoding logic– Compilers use quite a limited set of machine instructions for

mapping structures of a high level language

– Why should we provide more? (total of 69 mnemonics in Coffee)



Trade-off between instruction word length and decoding complexity– A ‘wide enough’ instruction word leads to more uniform encoding which in

turn makes decoding easier because encoding pattern does not vary so much

– On the other hand, the amount of don’t care bits increases for instructions requiring just a few bit fields, if instruction word is made longer

Trade-off between ‘capability’ of instructions and the length of the instruction word

– Issues which are directly affected by instruction word length:• amount of operands and targets per instruction• the length of immediate operands• the total amount of instructions in application code• ability to include special bit fields which ‘adjust’ the behavior of an

instruction• required memory bandwidth

The Coffee Way: Limit to 32 bits (and 16 bits in the other mode)



Encoding of instructions should be uniform– Less variation in encoding leads to simpler (and faster) decoding

process– In practice, instructions fall into a few categories having totally

different encoding. Allocating a bit field for identifying encoding class helps to implement fast decoding.

– The Coffee Way: 4 categories (+ some exceptional codings)• 3-operand, immediate, PC-relative branch, misc

Addressing modes supported– Most data addressing can be accomplished by register indirect or

displacement (base+offset) addressing– Indexed (base + index) requires three source registers in store op– The same problem occurs with scaled (base + shifted index)

addressing– The Coffee Way: displacement addressing supported by HW


How Many Registers?

In principle, the more registers, the less memory accesses Compilers find it easier to allocate variables to a lot of registers It is faster and more power efficient to compute on data in

registers

On the other hand, registers consume area and power (but less than memory accesses!)

A lot of memory traffic from task switch when a lot of registers (also increases task switch overhead)

The Coffee Way: Two 32x32 register files– To allow secure and fast super-user and user mode separation– Enough registers for efficient register allocation in the compiler


COFFEE TM Design Decisions on Arithmetics

Three versions of multiplication– 16x16 in two cycles

– 32x32 lower 32 bits in three cycles

– 32x32 upper 32 bits with an additional SW instruction cycle (to just store it after the multiplication execution) “re-using” EX3-stage

– Implemented as four 16x16 multipliers + adders + combination logic

– Internal architecture of the sub-multipliers hard-coded in VHDL (synthesis tool independence!)

No Divider– Division is a serial operation by nature

– Very few divisions used in typical algorithms

– Does not suit well into RISC philosophy

– Coprocessor can be used for division if SW is too slow Carry Look-Ahead Adder and Barrel Shifter


Other COFFEE TM Design Decisions

Conditional execution of instructions– To avoid performance penalty from the delay slots in control-

dominated code sections Flag generation and condition evaluation separated

– To break the critical path from arithmetic to conditional execution/branch

– To enable multiple conditions to be used (8 flag registers) Hardware checks for illegal data and program addresses

– You can never trust the software guys – Ok, it fits nicely in the pipeline on EX2 stage before data access on

EX3

– And can be done during the program memory access at fetch (no writes!)


Yet Another COFFEE TM Design Decisions

Harvard architecture at the core– Enables separate Instruction and Data caches– Or just Instruction cache (easy, no writes!)– Or direct on-chip memories

OCP-IP and BVCI compliant system bus interface– Can be wrapped e.g. to AMBA AHB to use ARM peripherals

Coprocessor interface– Application acceleration can be done this way– Four co-processors supported at a time– Compatible with MIPS coprocessors

Interrupts can be handled by– A simple interrupt control in the core (synchronization included)– An external interrupt controller– In most cases without flushing the pipeline! (interleaves the interrupt routine

with the regular program when possible) Boot address set externally

– Full flexibility regarding the program memory allocation


Making it Configurable and Parameterized

Separate configuration block provided– Memory mapped register block– Keeps the regular register files general-purpose but provides e.g. operating

system a straightforward way to set the scene for each context separately– To set e.g.

• Interrupt addresses and priorities• Protected/unprotected areas in data and memory address spaces• The configuration register block address itself (!)

Parameterization before synthesis is under research at the moment– different versions ranging from low power low performance to high power

high performance should be provided– for example, selecting the right type of multiplier for application is an

example of performance versus area (and power consumption) trade off.


COFFEE TM Design Status Working on

– Testing the final version of the synthesizable VHDL on Xilinx Virtex-II– High-level simulation models (VHDL and C++)– ISS (our own C++ cycle-accurate simulator)– HLL compiler (porting GNU compiler)– Tool chain polishing (our own assembler/disassembler etc.)– RTOS (porting a real-time flavour of Linux)– Caches, peripherals, system bus interfaces– Floating-point coprocessor (MILK)– Integration with Proteo network-on-chip / multiprocessor support / complete

platform Future Work

– Demonstration system– Benchmarking– Low-power features– Testability issues– Design space explorer (for ISA change consequence evaluation)– Branch prediction or speculative execution???– Documentation and website update– Contributions from all over the world through the publicity


COFFEE TM Synthesis Results

100%1.42CORE

16%0.23Control

60%0.85RF+PCB

24%0.34ALU16%0.23MULT_UNIT

(inside ALU)

PercentageArea mm2BLOCK

Technology 0.18 m, operating speed 200 MHz

On Xilinx Virtex-II, operating speed 40 MHz (no optimization)

On Altera Stratix, operating speed 57 MHz (no optimization)


Comparison of COFFEE TM and some other RISC coresFeature ARM7TDMI ARM940T MIPS32_4KP MIPS32_4MP COFFEE

Architecture vonNeumann Harvard Harvard Harvard Harvard

Instructions 32/16 32/16 32 * 32 * 32/16

Area in 0.18 um 0.53 mm2 hard

0.62 mm2 synth

4.2 mm2

incl. 4k + 4k caches

0.8 – 2.5 mm2 0.8 – 2.5 mm2 1.42 mm2

Clock rate 88 MHz h

80 – 110 MHz s

185 MHz 160 – 240 MHz 160 – 240 MHz 200 MHz

Registers visible

15 + PC 15 + PC 32 + 2 32 + 2 32

Registers total 31 31 32 + 2 * 32 + 2 * 64

Multiply cycles 32x32=64

3-6 3-6 32 3 2

Multiply cycles 32x32=32

2-5 2-5 32 3 1

Pipeline stages 3 5 5 5 6

Conditional yes yes no no yes


Coffee RISC core and Milk coprocessor

Coffee can be Coffee can be connected to several connected to several

peripheralsperipherals

FPUFPUup to 4 up to 4 coprocessorcoprocessor

ss

wr_cop

rd_cop rd_cop

wr_cop

c_index (1..0) c_index (1..0)

r_index (3..0) r_index (3..0)

cop_exc cop_exc

data (31..0) data (31..0)

COFFEE CORE MILK COPROCESSOR


Floating-point numbers representation

several representations allowed– exponent / fractional part representation

• single precision IEEE 754-1985 standard

)(^2),1()()^1( biasefsnumber

s e f

1 8 23


Milk coprocessor: general overview

32-bit internal parallelism 8 general purpose registers + 2 special purpose registers (status

register and control register) parallel execution units


Milk coprocessor: general overview

I/O REGFILE

ADDSUB

MUL

DIV

SQRT

ABS

NEG

CONV

TRUNC


Functional units: multiplier

LA

TC

HIN

GLA

TC

HIN

G LO

GIC

LO

GIC

EXP SUM&

m PROD

PO

STN

OR

MA

LIZ

ATIO

N S

TA

GE

PO

STN

OR

MA

LIZ

ATIO

N S

TA

GE

SIGNGEN

OPERANDSEXPANSION

SPECIALOPERANDSHANDLING

RESULTROUNDING

&PACKING


Results

instanceArea occupation

[μm²]Area occupation [nand gates]

Latency

[ns] CLK cycles

multiplicator 168,009 13 K 18.36 2

addsub 269,664 21 K 27.86 3

divider 889,045 70 K 94.72 10

sqrt 188,362 15 K 74.65 8

conv 27,164 2 K 8.85 1

trunc 29,777 2 K 9.12 1

reg_file 67,661 5 K 2.21 1

miscellaneous 12,493 1 K

TOT 1,652,175 129 K

Synthesis on STMicroelectronics 0.18μm standard cell technology @ 100MHz


Results

Comparison with a similar device: CalmRISC FPU from Samsung, synthesized on 0,25μm standard cells technology @ 70MHz

Instruction CalmRISC Milk

multiplication 2 2

add/sub 3 3

division 17 10

sqrt n.i. 8

conv 3 1

trunc 3 1


Conclusions

Coffee is a reasonably powerful RISC core combining the best features of commercial cores

Price/performance could be optimized by parameterization

All HW and SW start to be in place (including a FP coprocessor)

In public distribution – a Linux phenomenon to be seen?

Coffee is a good platform for further development and research

Documents

Tampere University of Technology Institute of Digital and Computer Systems 1 COFFEE TM RISC Core Design Case