View
213
Download
0
Tags:
Embed Size (px)
Citation preview
1 Tampere University of TechnologyInstitute of Digital and Computer Systems
COFFEE TM RISC Core Design Case
2 Tampere University of TechnologyInstitute of Digital and Computer Systems
Outline of Coffee Case
COFFEE TM RISC core design case– Basic architecture
– Defining the “perfect instruction set”
– Design decisions
– Results
– Conclusions
3 Tampere University of TechnologyInstitute of Digital and Computer Systems
COFFEE TM RISC Processor
Background– An embedded processor for SoC and processor architecture
research and education was needed– No access to commercial cores (especially the internal
implementation) We had to design our own core!
Architecture and implementation “prerequisites”– Clean RISC style was preferred– Control dominance but some DSP operations were foreseen– Balanced pipeline with current technologies– Technology independence
Consequences– Full multiplier and barrel shifter in the pipeline– 6 stages to keep the balance (3 +1 execution stages for the multiply)– Synthesizable VHDL implementation for technology independence
4 Tampere University of TechnologyInstitute of Digital and Computer Systems
Following RISC Philosophy
Issue rate of one instruction per cycle (not extending the cycle, but pipelining!)
Only load and store instructions access memory breaking the operation to two – three instruction cycles, not extending
Fixed instruction length fast and compact decoding logic Simplified addressing No hardware penalty from address
arithmetic Few simple operations short clock cycle, but larger amount of
code memory Delayed load and branches slots can be filled with meaningful
instructions by careful scheduling in order to hide the latencies Let the compiler do it (scheduling is important in RISC
processors)
5 Tampere University of TechnologyInstitute of Digital and Computer Systems
COFFEE TM Block Diagram
PROGRAMMEMORY
DECODELOGIC
EX1 EX2 EX3WRITEBACK
DATAMEMORY
FORWARDINGLOGIC
PCCONTROL
6 Tampere University of TechnologyInstitute of Digital and Computer Systems
Designing the ”Perfect Instruction Set”
Trade-off between the amount of work done per instruction and the time taken to execute an instruction– Estimating the execution time of an operation requires knowledge
about hardware implementation, which is not available at design time
– Comparing to existing implementations gives ideas but leads to reproducing the same ‘errors’
– More formal approach is needed: Analysis of complexity of operations implemented using different algorithms gives some measures, but is not very practical
– The Coffee Way: Keeping the instruction granularity on full-size arithmetic operations seemed as a safe choice
7 Tampere University of TechnologyInstitute of Digital and Computer Systems
Designing the ”Perfect Instruction Set” (cont’d)
Trade-off between compiler complexity and hardware complexity– Typically pipelined high performance RISC processors require
sophisticated compilers in order to fully exploit hardware performance
– Constructing high level operations from basic RISC instructions efficiently requires both knowledge of underlying hardware and knowledge of different algorithms
– To reduce complexity of a compiler and improve performance hardware support for resolving data dependencies may be provided. This comes in form of forwarding and stall logic.
– The Coffee Way: Extensive forwarding is used to simplify the construction of the compiler
• Very little software optimization needed to avoid data dependence
• Highest possible throughput
8 Tampere University of TechnologyInstitute of Digital and Computer Systems
Designing the ”Perfect Instruction Set” (cont’d)
Trade-off between the amount of memory consumed by program code and the complexity of instructions– Typically CISC instructions are more compact and ‘do more’ than
RISC instructions
– A ‘cure’ for this is alternative 16-bit encoding found e.g. in ARM
– The Coffee Way: 16/32-bit encodings are provided Trade of between the amount of instructions and complexity of
decoding logic– Compilers use quite a limited set of machine instructions for
mapping structures of a high level language
– Why should we provide more? (total of 69 mnemonics in Coffee)
9 Tampere University of TechnologyInstitute of Digital and Computer Systems
Designing the ”Perfect Instruction Set” (cont’d)
Trade-off between instruction word length and decoding complexity– A ‘wide enough’ instruction word leads to more uniform encoding which in
turn makes decoding easier because encoding pattern does not vary so much
– On the other hand, the amount of don’t care bits increases for instructions requiring just a few bit fields, if instruction word is made longer
Trade-off between ‘capability’ of instructions and the length of the instruction word
– Issues which are directly affected by instruction word length:• amount of operands and targets per instruction• the length of immediate operands• the total amount of instructions in application code• ability to include special bit fields which ‘adjust’ the behavior of an
instruction• required memory bandwidth
The Coffee Way: Limit to 32 bits (and 16 bits in the other mode)
10 Tampere University of TechnologyInstitute of Digital and Computer Systems
Designing the ”Perfect Instruction Set” (cont’d)
Encoding of instructions should be uniform– Less variation in encoding leads to simpler (and faster) decoding
process– In practice, instructions fall into a few categories having totally
different encoding. Allocating a bit field for identifying encoding class helps to implement fast decoding.
– The Coffee Way: 4 categories (+ some exceptional codings)• 3-operand, immediate, PC-relative branch, misc
Addressing modes supported– Most data addressing can be accomplished by register indirect or
displacement (base+offset) addressing– Indexed (base + index) requires three source registers in store op– The same problem occurs with scaled (base + shifted index)
addressing– The Coffee Way: displacement addressing supported by HW
11 Tampere University of TechnologyInstitute of Digital and Computer Systems
How Many Registers?
In principle, the more registers, the less memory accesses Compilers find it easier to allocate variables to a lot of registers It is faster and more power efficient to compute on data in
registers
On the other hand, registers consume area and power (but less than memory accesses!)
A lot of memory traffic from task switch when a lot of registers (also increases task switch overhead)
The Coffee Way: Two 32x32 register files– To allow secure and fast super-user and user mode separation– Enough registers for efficient register allocation in the compiler
12 Tampere University of TechnologyInstitute of Digital and Computer Systems
COFFEE TM Design Decisions on Arithmetics
Three versions of multiplication– 16x16 in two cycles
– 32x32 lower 32 bits in three cycles
– 32x32 upper 32 bits with an additional SW instruction cycle (to just store it after the multiplication execution) “re-using” EX3-stage
– Implemented as four 16x16 multipliers + adders + combination logic
– Internal architecture of the sub-multipliers hard-coded in VHDL (synthesis tool independence!)
No Divider– Division is a serial operation by nature
– Very few divisions used in typical algorithms
– Does not suit well into RISC philosophy
– Coprocessor can be used for division if SW is too slow Carry Look-Ahead Adder and Barrel Shifter
13 Tampere University of TechnologyInstitute of Digital and Computer Systems
Other COFFEE TM Design Decisions
Conditional execution of instructions– To avoid performance penalty from the delay slots in control-
dominated code sections Flag generation and condition evaluation separated
– To break the critical path from arithmetic to conditional execution/branch
– To enable multiple conditions to be used (8 flag registers) Hardware checks for illegal data and program addresses
– You can never trust the software guys – Ok, it fits nicely in the pipeline on EX2 stage before data access on
EX3
– And can be done during the program memory access at fetch (no writes!)
14 Tampere University of TechnologyInstitute of Digital and Computer Systems
Yet Another COFFEE TM Design Decisions
Harvard architecture at the core– Enables separate Instruction and Data caches– Or just Instruction cache (easy, no writes!)– Or direct on-chip memories
OCP-IP and BVCI compliant system bus interface– Can be wrapped e.g. to AMBA AHB to use ARM peripherals
Coprocessor interface– Application acceleration can be done this way– Four co-processors supported at a time– Compatible with MIPS coprocessors
Interrupts can be handled by– A simple interrupt control in the core (synchronization included)– An external interrupt controller– In most cases without flushing the pipeline! (interleaves the interrupt routine
with the regular program when possible) Boot address set externally
– Full flexibility regarding the program memory allocation
15 Tampere University of TechnologyInstitute of Digital and Computer Systems
Making it Configurable and Parameterized
Separate configuration block provided– Memory mapped register block– Keeps the regular register files general-purpose but provides e.g. operating
system a straightforward way to set the scene for each context separately– To set e.g.
• Interrupt addresses and priorities• Protected/unprotected areas in data and memory address spaces• The configuration register block address itself (!)
Parameterization before synthesis is under research at the moment– different versions ranging from low power low performance to high power
high performance should be provided– for example, selecting the right type of multiplier for application is an
example of performance versus area (and power consumption) trade off.
16 Tampere University of TechnologyInstitute of Digital and Computer Systems
COFFEE TM Design Status Working on
– Testing the final version of the synthesizable VHDL on Xilinx Virtex-II– High-level simulation models (VHDL and C++)– ISS (our own C++ cycle-accurate simulator)– HLL compiler (porting GNU compiler)– Tool chain polishing (our own assembler/disassembler etc.)– RTOS (porting a real-time flavour of Linux)– Caches, peripherals, system bus interfaces– Floating-point coprocessor (MILK)– Integration with Proteo network-on-chip / multiprocessor support / complete
platform Future Work
– Demonstration system– Benchmarking– Low-power features– Testability issues– Design space explorer (for ISA change consequence evaluation)– Branch prediction or speculative execution???– Documentation and website update– Contributions from all over the world through the publicity
17 Tampere University of TechnologyInstitute of Digital and Computer Systems
COFFEE TM Synthesis Results
100%1.42CORE
16%0.23Control
60%0.85RF+PCB
24%0.34ALU16%0.23MULT_UNIT
(inside ALU)
PercentageArea mm2BLOCK
Technology 0.18 m, operating speed 200 MHz
On Xilinx Virtex-II, operating speed 40 MHz (no optimization)
On Altera Stratix, operating speed 57 MHz (no optimization)
18 Tampere University of TechnologyInstitute of Digital and Computer Systems
Comparison of COFFEE TM and some other RISC coresFeature ARM7TDMI ARM940T MIPS32_4KP MIPS32_4MP COFFEE
Architecture vonNeumann Harvard Harvard Harvard Harvard
Instructions 32/16 32/16 32 * 32 * 32/16
Area in 0.18 um 0.53 mm2 hard
0.62 mm2 synth
4.2 mm2
incl. 4k + 4k caches
0.8 – 2.5 mm2 0.8 – 2.5 mm2 1.42 mm2
Clock rate 88 MHz h
80 – 110 MHz s
185 MHz 160 – 240 MHz 160 – 240 MHz 200 MHz
Registers visible
15 + PC 15 + PC 32 + 2 32 + 2 32
Registers total 31 31 32 + 2 * 32 + 2 * 64
Multiply cycles 32x32=64
3-6 3-6 32 3 2
Multiply cycles 32x32=32
2-5 2-5 32 3 1
Pipeline stages 3 5 5 5 6
Conditional yes yes no no yes
19 Tampere University of TechnologyInstitute of Digital and Computer Systems
Coffee RISC core and Milk coprocessor
Coffee can be Coffee can be connected to several connected to several
peripheralsperipherals
FPUFPUup to 4 up to 4 coprocessorcoprocessor
ss
wr_cop
rd_cop rd_cop
wr_cop
c_index (1..0) c_index (1..0)
r_index (3..0) r_index (3..0)
cop_exc cop_exc
data (31..0) data (31..0)
COFFEE CORE MILK COPROCESSOR
20 Tampere University of TechnologyInstitute of Digital and Computer Systems
Floating-point numbers representation
several representations allowed– exponent / fractional part representation
• single precision IEEE 754-1985 standard
)(^2),1()()^1( biasefsnumber
s e f
1 8 23
21 Tampere University of TechnologyInstitute of Digital and Computer Systems
Milk coprocessor: general overview
32-bit internal parallelism 8 general purpose registers + 2 special purpose registers (status
register and control register) parallel execution units
22 Tampere University of TechnologyInstitute of Digital and Computer Systems
Milk coprocessor: general overview
I/O REGFILE
ADDSUB
MUL
DIV
SQRT
ABS
NEG
CONV
TRUNC
23 Tampere University of TechnologyInstitute of Digital and Computer Systems
Functional units: multiplier
LA
TC
HIN
GLA
TC
HIN
G LO
GIC
LO
GIC
EXP SUM&
m PROD
PO
STN
OR
MA
LIZ
ATIO
N S
TA
GE
PO
STN
OR
MA
LIZ
ATIO
N S
TA
GE
SIGNGEN
OPERANDSEXPANSION
SPECIALOPERANDSHANDLING
RESULTROUNDING
&PACKING
24 Tampere University of TechnologyInstitute of Digital and Computer Systems
Results
instanceArea occupation
[μm²]Area occupation [nand gates]
Latency
[ns] CLK cycles
multiplicator 168,009 13 K 18.36 2
addsub 269,664 21 K 27.86 3
divider 889,045 70 K 94.72 10
sqrt 188,362 15 K 74.65 8
conv 27,164 2 K 8.85 1
trunc 29,777 2 K 9.12 1
reg_file 67,661 5 K 2.21 1
miscellaneous 12,493 1 K
TOT 1,652,175 129 K
Synthesis on STMicroelectronics 0.18μm standard cell technology @ 100MHz
25 Tampere University of TechnologyInstitute of Digital and Computer Systems
Results
Comparison with a similar device: CalmRISC FPU from Samsung, synthesized on 0,25μm standard cells technology @ 70MHz
Instruction CalmRISC Milk
multiplication 2 2
add/sub 3 3
division 17 10
sqrt n.i. 8
conv 3 1
trunc 3 1
26 Tampere University of TechnologyInstitute of Digital and Computer Systems
Conclusions
Coffee is a reasonably powerful RISC core combining the best features of commercial cores
Price/performance could be optimized by parameterization
All HW and SW start to be in place (including a FP coprocessor)
In public distribution – a Linux phenomenon to be seen?
Coffee is a good platform for further development and research