33
Communicating Process Architectures 2018 1 K. Chalmers, J.B. Pedersen et al. (Eds.) IOS Press, 2018 © 2018 The authors and IOS Press. All rights reserved. T42 – Transputer in FPGA Uwe MIELKE a and Martin ZABEL b , in collaboration w/ Michael BRUESTLE c a Electronics Engineer, Dresden, Germany, [email protected] b Institute of Computer Engineering, Technische Universitaet Dresden, Germany, [email protected] ; [email protected] c Electronics Engineer, Vienna, Austria, [email protected] Abstract. The T42 Transputer in FPGA is a full binary-code compatible open- source VHDL implementation of the Inmos IMS-T425 32bit microprocessor. The T42 is mainly targeted for education and exploration purposes and many interlinked cores can be loaded onto any suitable sized FPGA board. The construction of larger parallel computing systems based on the Transputers distributed memory architecture is possible (e.g. within student semester projects). The programming model of occam can be evaluated and compared to other languages. Keywords. Transputer, FPGA processor, soft-core, soft-CPU, many-core, parallel architecture, distributed memory, communicating sequential processes, occam Introduction The intention of our T42 Transputer in FPGA open source VHDL project is to provide an alternative parallel computing platform based on the distributed memory architecture. It may be used to complement the teaching of computer engineering and computer architecture – beyond the omnipresence of MIPS and other RISC architectures today. The Transputer with its inherent message passing capabilities seems to us an ideal toy, targeted for education of students – as well as interested engineers – to explore the programming model of occam (based on C.A.R. Hoare’s CSP [12]) and to play with its hardware. The advent of the Transputer was in 1984. At this time computing performance and clock frequencies have been limited. An elegant solution to tackle larger problems was provided by Inmos with their Transputer, meant as a building block to create large parallel computing systems. Its programming language occam allowed code development on a single Transputer and afterwards distribution on theoretically any number of Transputers for cooperative execution in parallel. Thereby new software engineering possibilities have been enabled, which inspired complex scientific solutions that never could be solved before. During this “golden age” the Transputer received considerable funding from the European Strategic Program on Research in Information Technology (ESPRIT). A significant number of programming languages, scientific libraries and operating systems had been developed for the Transputer [51-54]. Our T42 Transputer in FPGA is full binary-code compatible to the Inmos IMS-T425. One or many T42 cores can be loaded onto any suitable sized FPGA-board. The 4 bidirectional links of each core can be used for plug and play interconnection to easily build quite large computing networks. A further positive aspect is the still available Transputer software stock including the Kent retargetable occam-pi compiler [39-43]. Section-1 of this paper will provide some basic Transputer information. Section-2 explains the T42 implementation and section-3 the T42 verification. Section-4 will describe synthesis results on two FPGA-boards and a discussion will follow in Section-5. CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

  • Upload
    others

  • View
    27

  • Download
    4

Embed Size (px)

Citation preview

Page 1: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

Communicating Process Architectures 2018 1K. Chalmers, J.B. Pedersen et al. (Eds.)IOS Press, 2018© 2018 The authors and IOS Press. All rights reserved.

T42 – Transputer in FPGAUwe MIELKE a and Martin ZABEL b , in collaboration w/ Michael BRUESTLE c

a Electronics Engineer, Dresden, Germany, [email protected] Institute of Computer Engineering, Technische Universitaet Dresden, Germany,

[email protected] ; [email protected] Electronics Engineer, Vienna, Austria, [email protected]

Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation of the Inmos IMS-T425 32bit microprocessor. TheT42 is mainly targeted for education and exploration purposes and many interlinkedcores can be loaded onto any suitable sized FPGA board. The construction of largerparallel computing systems based on the Transputers distributed memoryarchitecture is possible (e.g. within student semester projects). The programmingmodel of occam can be evaluated and compared to other languages.

Keywords. Transputer, FPGA processor, soft-core, soft-CPU, many-core, parallelarchitecture, distributed memory, communicating sequential processes, occam

Introduction

The intention of our T42 Transputer in FPGA open source VHDL project is to provide analternative parallel computing platform based on the distributed memory architecture. Itmay be used to complement the teaching of computer engineering and computerarchitecture – beyond the omnipresence of MIPS and other RISC architectures today. TheTransputer with its inherent message passing capabilities seems to us an ideal toy, targetedfor education of students – as well as interested engineers – to explore the programmingmodel of occam (based on C.A.R. Hoare’s CSP [12]) and to play with its hardware.

The advent of the Transputer was in 1984. At this time computing performance andclock frequencies have been limited. An elegant solution to tackle larger problems wasprovided by Inmos with their Transputer, meant as a building block to create large parallelcomputing systems. Its programming language occam allowed code development on asingle Transputer and afterwards distribution on theoretically any number of Transputersfor cooperative execution in parallel. Thereby new software engineering possibilities havebeen enabled, which inspired complex scientific solutions that never could be solvedbefore. During this “golden age” the Transputer received considerable funding from theEuropean Strategic Program on Research in Information Technology (ESPRIT). Asignificant number of programming languages, scientific libraries and operating systemshad been developed for the Transputer [51-54].

Our T42 Transputer in FPGA is full binary-code compatible to the Inmos IMS-T425.One or many T42 cores can be loaded onto any suitable sized FPGA-board. The 4bidirectional links of each core can be used for plug and play interconnection to easily buildquite large computing networks. A further positive aspect is the still available Transputersoftware stock including the Kent retargetable occam-pi compiler [39-43].

Section-1 of this paper will provide some basic Transputer information. Section-2explains the T42 implementation and section-3 the T42 verification. Section-4 will describesynthesis results on two FPGA-boards and a discussion will follow in Section-5.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 2: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

2 U. Mielke et al. / T42 – Transputer in FPGA

1. Some Basics: Transputer and occam

A brief description of some Transputer fundamentals is given here. The Transputer – as abuilding block for larger computing networks – consist of an integer CPU, on-chip fastmemory, 2 timers and 4 bi-directional serial communication links with concurrent DMAcapabilities.

The development of the Transputer and its programming language occam at Inmos wasfounded on the theory of Communicating Sequential Processes (CSP) from C.A.R. Hoare[12], first published in 1978. A computer program written in occam can be considered as aquantity of (concurrent) individual processes which communicate via channels (and fromprocess point of view this communication can be understood as assignment of variables). Inoccam a channel is used to synchronise a sender process and its receiver process of amessage. A channel may be either a word in memory or a serial communication link. Thisallows processes, which reside on different Transputers, to communicate based on the sameprinciples like in the memory space of a single Transputer.

The Transputer instruction set architecture reflects the occam programming modeland the Transputer itself can be understood as its hardware implementation [13-15]. TheTransputer is capable of autonomously executing (theoretically) any number of(concurrent) processes. This “process scheduler” is a unique feature of the Transputer,which can be considered as a very simple operating system (written in microcode). Thescheduler is capable to manage processes and events on two priority levels by help of linkedlists (process queues). Each ready-to-run process holds its state (variables, next instructionpointer value) in a defined memory location: the workspace. Thus only the pointer to thisworkspace is necessary to define the process. When a process is running, its current state isextended by the processor registers (evaluation stack, instruction pointer, workspacepointer, status register).

Events may happen anytime caused by requests from the 8 link channels, 2 timers (ontwo priority levels) or one external event channel. The scheduler will handle all requestsbased on their priority and order and the corresponding process will be either startedimmediately or put on a linked list (process queue) for later execution.

More and detailed information about the functionality of the Transputer can be foundin the Inmos Transputer documentation [22, 23] and literature [34-38].

2. T42 Implementation

2.1 Retrospective

The idea of a T42 Transputer in FPGA and first schematics have been drawn in 2012already, but the real impetus came from David May’s publication of the Inmos „Simple-42”Transputer prototype documents in spring 2013 [6-8], together with the IMS-T414 highlevel microcode [9]. Further encouragement was given by Roger Shepherd in 2014 byproviding an early Inmos IMS-T425 specification [10].

The whole T42 design project was a great collaboration work. The T42 core design hasbeen done by Uwe Mielke as a guest student at Technische Universitaet Dresden and tookover 5 years of his spare time. Michael Bruestle (Vienna) provided the TransputerVerification Suite (TVS-1) for checking the T42 hardware compliance versus a real IMS-T425. Martin Zabel from the academic stuff of the Chair of VLSI Design, Diagnostics andArchitecture provided – beside his VHDL design and teaching experience – cachecontroller and DDR-SDRAM controller interface as an adaption of an availableimplementation from an educational MIPS practical VHDL course.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 3: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 3

It is more than curiosity that the Chair of VLSI Design, Diagnostics and Architectureat the Faculty of Computer Science, Technische Universitaet Dresden, Germany, supportedthe redevelopment of a Transputer core in FPGA. The Institute of Computer Engineeringhas rich experience with its own developed multi-core Java byte-code processor in FPGA[1, 2] named SHAP (Secure Hardware Agent Platform). Many papers, some diploma thesisand PhD dissertations [3, 4] have been written about SHAP since 2006.

2.2 Design Targets

The T42 Transputer in FPGA is full binary-code compatible to the original Inmos IMS-T425 – see comparison in Figure 1. Its VHDL sources will be made available as opensource under the GNU public license either version 3 of the license or (at your option) anylater version. The full instruction set has been implemented and any still availableTransputer software should run. Our primary design targets have been defined as follows:

1) achieve full binary-code compatibility to Inmos IMS-T425,2) enable easy loading of many T42 cores in one suitable sized FPGA,3) provide an interface to state-of-the-art (DDR-SDRAM) memory.

Verification and documentation of each design step are mandatory objectives.

32bitCPU

8...32kBCache

Timers

Link 0-3

ExternalMemoryInterface

4kBRAM

DDR-RAMController

IMS-T425 T42-in-FPGA

SystemServices

Event

32bitCPU

Timers

Link 0-32...8kBRAM

SystemServices

Event

ext. Memory(dyn.FPM-RAM)

ext. Memory(DDR-SDRAM)

Figure 1. Block Schematic - Comparison of IMS-T425 vs T42-in FPGA.

Our T42 is a step-by-step bottom-up design and almost every VHDL block has beenrewritten several times. The reverse engineering approach required many circles of:

• understanding instruction execution,• reengineering of the underlying hardware architecture,• microcode retrieval and verification.

For this reason CPU performance could not be a major design target, but if available as alow hanging fruit it was taken. Our idea was that the design could be improved by studentsemester projects later. Finally many T42 cores should run on a cheap student FPGA-board.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 4: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

4 U. Mielke et al. / T42 – Transputer in FPGA

Our design preparation started with an in-depth analysis of the Inmos Simple-42 [6-8].This formed the basis of our T42 evaluation stack, its bus connections and ALU control.Micro-word structure and signal names from the S42 gave us a starting point and have beenreused as far as useful. A nice S42 summary prepared by Gavin Crate can be found on hiswebsite [11]. Additionally several Inmos patents have been studied [16-21].

Necessary data-path extensions came from an in-depth analysis of dedicatedinstruction requirements [25-35, 37], for instance the byte-align principles for MOVE,Booth’s MUL- and non-restoring DIV-algorithms, CRC, BITREV and IEEE-754 32bitfloating point support.

2.3 VHDL Implementation

A fine grained VHDL hierarchy – see Figure 2 – was chosen to allow a modular and easyreplacement of low level functional building blocks during development and refinement.1st (bottom) level VHDL blocks contain all random and sequential logic which make thefunctionality. 2nd and 3rd level VHDL blocks, especially those with a “path” in their name,are pure structural (i.e. connection) layers. The same is valid for all higher VHDL levels.Here on 3rd level the T42 processor, system-control unit, local memory subsystem and linkswill be composed.

datapath(structural):

abcdereg,alu,wptr,pointers,constox,bytealign,pipectrl.

linkpath(structural):

synclogic,chanoutchanin,chanevent.

syspath (structural):sysctrl, sbits, timer, sysservice

memories (library):ucode_rom1024x128,intern_dpsram2kx32,extern_dpsram2kx32.

PoC (library):fifo,cache_mem,ddr2_mem2mig.

ctrlpath(structural):

constpkg,functpkg,ucode_rom,idecode,oreg,iptr,prefetch.

mempath (structural):decoder, intarbiter, mem_intern,extarbiter, extpocadpt, readdatadist.

ctrl2data (structural)

cpu_top (structural)b004_top (structural)

linkif (library):serdes_os,...

Figure 2. T42 VHDL Hierarchy.

The 4th level VHDL can be used to create super-blocks, for instance to emulate an InmosIMS-B004-board with external memory or a larger Inmos IMS-B008-like-board with moreinterconnected Transputers and individual external (distributed) memory per CPU – shownas examples in Figure 3 and Figure 4.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 5: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 5

InstructionDecode Integer

ExecutionUnit

Memory Interface (Arbiter)

PrefetchUnit

System Control Unit (including Timers)

Link Unit

8kBinternalMemory

PoC Adaptor

CacheController

32kB CacheMemory

DDR-SDRAMController

128MBDDR2

Memory

OS-SerDes

OS-SerDes

OS-SerDes

OS-SerDes

ext. Event

T42 Core

FPGA-board

12832

32

3232

3232 32

8

8

8

8

10pin PMOD Connector

10pin PMOD Connector

10pin PMOD Connector

10pin PMOD Connector

16

Figure 3. T42 Implementation Example, IMS-B004 like Configuration on FPGA-Board.

XC7A100

T0(root)

T1 T2 T3

Arbiter

DDR-RAMController

ext. Memory (DDR-SDRAM)down down down

up

Figure 4. T42 Implementation Example, small IMS-B008 like Configuration on FPGA-Board.

From here onwards the students phantasy is asked in future, for instance to integrateadditional functional blocks like a graphics subsystem, an Ethernet or SD-Card controller.The Inmos “LEGO”-like Transputer Modules (TRAMs) can be considered as examples.☺

2.4 Processor Pipeline

To achieve a reasonable CPU performance some units of our T42 processor core shouldwork simultaneous. Therefore an autonomous pre-fetch unit has been designed to

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 6: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

6 U. Mielke et al. / T42 – Transputer in FPGA

complement our simple T42 processor pipeline which consists of two stages only:

PF is an autonomous finite state machine (not part of the pipeline),stage-1: IF/ID instruction fetch & decode,stage-2: EX execution (one up to multiple clocks),

Thereby a continuous overlap of instruction decode with execute is possible. Instructionsmay execute in one or multiple clocks (example see Figure 5). Note: the microcode signalNEXTACTION will trigger the next instruction-decode within each last execution clock.

stall

ID EX1 EX2

ID EX1

ID EX

ID EX

EX2

ID EX1 EX2

LDL

LDNL

LDC

SUM

LDL

OpC / clk 1. 2. 3. 4. 5. 6. 7. 9.8.

STNL ID EX1

10. 11.

EX2

12.

Figure 5. Overlapped Instruction Decode and Execution in T42 Pipeline.

Memory read/write access happens within execute clocks. For internal memory read access2 clocks are required (1st: address, 2nd: data), for external memory read access the numberof clocks depends on cache hit (3 clocks: 1st: address, 2nd: tag-access, 3rd: data) or cachemiss (n clocks depending on DDR-SDRAM access time). Because the T42 microcode iswritten for memory read access with 2 clocks, any external memory read access requires aminimum of one pipeline stall clock until the data arrives. Memory write requests can beissued any time and require one (STL) or two clocks (STNL) only.

2.5 Pre-fetch Unit

We have decided to implement an instruction pre-fetch unit as separate, autonomous FSM.Thereby CPU instruction fetch becomes part of the instruction decode clock. A 2 word pre-fetch buffer is providing 8 instructions in a queue. After the last instruction byte of a wordbuffer has been decoded, the used word buffer will become invalidated. A new (word) fetchcycle will start immediately, while instructions from the second word buffer are still beingexecuted. Therefore a continuous instruction flow for execution is guaranteed.

In case the instruction pointer is loaded with a new value (e.g. due to a JUMP or LENDor CALL) both word buffers become invalidated and the pipeline will stall until the nextrequired instruction is fetched.

2.6 Microcode

Similar to the original Transputer, our T42 is controlled by a horizontal microcode controlunit (sequencer, see Figure 6). A microcode-word contains many single bits or bit-groups(further called: micro-operations) which directly control the behaviour of functional logic

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 7: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 7

blocks (e.g. multiplexors) in the underlying hardware. Human readable mnemonics are usedto describe these micro-operations. Thus microcode is an effective abstraction layer tosimplify the description of pre-defined state changes (sequencing) within the data-path.

Mic

roco

de

-RO

MA

dd

ress

De

cod

er

01234567

Control Store

InstructionDecoder

Data Path

Microcode (Control) Word

ROMFEEDBAK

MicroProgram

ConditionMultiplexors

EntryAddress

Mux

Condition Bitsfrom Data Path

Condition Bitsfrom Data Path

8

110

10

1

1

4

4

Mic

roIn

stru

ctio

nR

egis

ter

Figure 6. Horizontal Microcode Control Unit (Sequencer) Principle.

In example-1 the primary instruction LDC (load constant) is described with its micro-operations. Each micro-operation controls a multiplexor and thereby the data flow fromOreg via Ybus and Zbus into Areg. Please note that all micro-operation signals listed withina bracket-pair are simultaneously present in the same clock (i.e. their order don’t care):

104; LDC n { Y_BUS_FROM_oreg; -- constant from Oreg ...

A_REG_FROM_zbus; -- ... into Areg

B_REG_FROM_areg; -- push stack

C_REG_FROM_breg; -- push stack

Z_ALU_FROM_ybus; -- Ybus pass-through

NEXTACTION_1 }; -- end of instruction

In total there are over 40 micro-operations (signals) to be described per clock (seeAppendix A.). A so called default micro-word contains all no-operation settings per micro-operation. The microcode-assembler will apply these defaults to all not listed (not affected)micro-operations, to avoid their individual repetition within each instruction.

In example-2 the primary instruction CJ (conditional jump) is explained. Condition-multiplexor ‘0’ will be used to allow a microcode-branch into the 2nd clock, depending onthe result of a comparison initialised in the 1st clock. Here Areg will be tested by acomparator named zeq0 (Zbus equals Null) to be zero (logical FALSE) or not:

10A; CJ { X_BUS_FROM_areg; -- false/true in Areg

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 8: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

8 U. Mielke et al. / T42 – Transputer in FPGA

CMUX0_FROM_zeq0; -- '1' if Zbus=0

O_REG_FROM_oreg; -- jump offset in Oreg

Z_ALU_FROM_xbus; -- Xbus pass-through

ROMFEEDBAK_0x420 }; -- 2-fold symbolic branch

10A; 0x420 { A_REG_FROM_breg; -- pop stack (if true)

B_REG_FROM_creg; -- pop stack (if true)

NEXTACTION_1 }; -- end of instruction

10A; 0x421 { I_PTR_FROM_zbus; -- load new Iptr (if false)

X_BUS_FROM_iptr; -- old Iptr

Y_BUS_FROM_oreg; -- jump offset

MADDR_MODE_byte; -- ALU in byte mode

Z_ALU_FROM_xplusy; -- add Iptr + Oreg

ROMFEEDBAK_0x0FF }; -- goto NOP

0FF; NOP { NEXTACTION_1 }; -- end of instruction

Here the 2-fold branch is described by the symbolic address 0x420 which will beinterpreted by the microcode-assembler as a placeholder. The replacement by a physicaladdress will be done later during ROMFEEDBAK address allocation. Please note that forCJ the branch to 0x420 ends with NEXTACTION whereas the branch to 0x421 does not!Here an additional clock is added by use of a NOP micro-word with entry-address 0x0FF.This is a pipeline requirement because loading Iptr will immediately invalidate both pre-fetch buffers. Iptr will become earliest available in the next clock (during NOP) for the newinstruction fetch.

More complex instructions require up to a dozen micro-words (or even more). For themicrocode-assembler any address value above 0x400 defines a symbolic address in strictorder. Some predefined values (0x?20, 0x?40, 0x?60) are required for 2-fold and 4-foldbranches and all other values are freely available to describe micro-word sequences (e.g.0x401, 0x402, 0x403, 0x50?, 0x60?, 0x70?, e.t.c.).

Any micro-word sequence can be generated by use of the 10bit address in the micro-operation field for ROMFEEDBAK. In case of an intended branch, the bit 0 and/or bit1from ROMFEEDBAK will be replaced by help of two condition-multiplexors (CMUX0 andCMUX1) with the result from a condition evaluation. Therefore a 4-fold microcode-branchcan implement loops with multiple break conditions.

2.7 Microcode ROM

The microcode-ROM outputs one micro-word every clock. This micro-word controls thewhole logic in the data-path (e.g. multiplexors), data-path related hardware (e.g. thecondition-multiplexors in the control-path), status-bits for system-control and dedicatedhardware blocks (e.g. timers and link channels).

Our actual microcode-ROM size is 1024x128bit, utilization is about 70%. The biggestconsumers are the 11 ALT instructions (micro-words: 85), 5 MOVE/MOVE2D instructions(48), DIV+REM (31), FMUL (18) and ROUNDSN (18). Scheduler routines (e.g. for queuehandling) occupy ~ 200 micro-words. About 90% of the T42 microcode-ROM bits are zero.

Info: The original T425 microcode-ROM is sized 742x118bit (~90kBit) and 90.41% of thebits are zero, as investigated by Gavin Crate in November 2016.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 9: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 9

2.8 Microcode-Assembler

Our microcode-assembler (AWK) script reads its setup directly from 3 files in text format:

1.) t42_cpu_constpkg.vhd – contains all micro-operation definitions and2.) t42cpu_ucode_rom.vhd – contains micro-word bit assignments.3.) The microcode itself is described in a *.csv (Excel spreadsheet) containing the

handwritten mnemonics of all necessary micro-operations per instruction.

Script 1.1 and 1.2 are providing the configuration setup to the assembler. Scripts 2.1 to 2.7are the multi pass assembler. Each sub task script is doing a full pass over the whole micro-word list. Scripts 3.1 and 3.2 are doing final format conversions. The last step 3.3 is doneby a Xilinx tool which will transform the memory hex-file into a VHDL constants (pkg)description suitable for memory configuration during simulation and synthesis.

1.1 PRE PROCESSING - READ BIT POSITIONS OF MICRO-OP's

1.2 PRE PROCESSING - READ MICRO-OP IDs AND CODING

2.1 PRIMARY PROCESSING - ASSEMBLE MICRO-WORDs

2.2 SECONDARY PROCESSING - SORT MICRO-WORD COLUMNs

2.3 SECONDARY PROCESSING - SORT MICRO-WORD ROWs

2.4 SECONDARY PROCESSING - ALLOCATE FIXED ADDRESSES

2.5 SECONDARY PROCESSING - TABULATE BRANCH CAPABILITIES

2.6 SECONDARY PROCESSING - ALLOCATE JUMP+BRANCH LABELS

2.7 SECONDARY PROCESSING - CALCULATE ROMFEEDBAK ADDR

3.1 POST PROCESSING - BUILD ROM (BINARY FORMAT)

3.2 POST PROCESSING - WRITE HEX ROM

3.3 POST PROCESSING - BUILD uCodeROM - CALL XILINX DATA2MEM

The generation of a new microcode-ROM takes less than a minute. So the instructiondebugging and re-simulation becomes easy.

2.9 Control-path

The control-path is responsible for continuous CPU operation during microcode execution.Microcode entry-addresses will be provided either by the microcode-ROM itself or thesystem-control unit (for scheduler and event routines). The control-path components are:

• T42 constants definition package,• T42 VHDL functions definition package,• Microcode-ROM (1024x128bit),• Instruction decoder,• Operand register (Oreg, temporary register),• Instruction pointer (Iptr),• Pre-fetch Unit (with 2 word Pre-fetch buffer).

2.10 Instruction Decoder

Transputer byte code allows quite easy instruction decode. Each instruction byte consists of2 nibbles. The higher code nibble describes the Operation or better to say the lower 4bitpart of the Op-Code entry address into the microcode-ROM. The higher 6bit part is a fixedvalue. For primary instructions 0x0 to 0xE the data nibble always carries an Operand.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 10: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

10 U. Mielke et al. / T42 – Transputer in FPGA

The common entry for all secondary instructions is the primary instruction 0xF wherethis code nibble means OPERATE. For decoding secondary instructions the operandregister is additionally required and the accumulated Operand is interpreted as a 10bit Op-Code entry address now.

2.11 Data Path

Areg, Breg and Creg are forming the evaluation stack (Figure 7), which is complementedby the 2 internal registers Dreg and Ereg. Together with Oreg (as temporary register) the

TmClkRegs

Pointers

ALU

XbusMux

YbusMux

Carry Out

Carry In

Oreg

Areg

Breg

Creg

Dreg

Ereg

Wptr

Iptr

Front PointerBack Pointer

ConstBox

TmPreSet

TmClkRegs

Byte-Align

Sreg

Zbus

Dout Bus Mux

Nbus

Dbus

Data InBus

CPUAddress

Bus

Data OutBus

ToMemory

Path

Mbus

YbusXbus

YbusXbus

low

high

Zbus

high

low

Figure 7. T42 Data-Path (simplified) with Bus System.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 11: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 11

two internal registers are used for all scheduler operations (note: always possible betweenany two instructions!), while Areg, Breg, Creg remain unchanged. During instructionexecution all these CPU registers may be in use and their content can change concurrentlyevery clock (e.g. shift operations are executed in the data path and not in the ALU).

All T42 internal data sources (registers, pointers, constants box, timer registers, statusregister) are either connected to Xbus or Ybus. Our very simple ALU is fed by Xbus andYbus and produces Zbus as result. Zbus is also the CPU address bus for memory access.Therefore the ALU has different operation modes for data calculation and (a displacementbased) address calculation. All above data sources are further connected to the byte-aligner,which will provide the CPU data-out bus towards – and receive the CPU data-in bus from– memory. The byte-aligner performs all necessary part-word operations (including LoadByte, Store Byte and MOVE) and relays data between memory and registers. Data-pathcomponents are:

• Evaluation stack (Areg, Breg, Creg) plus internal registers (Dreg, Ereg),• ALU with Xbus, Ybus, Zbus,• Workspace pointer (Wprt),• Pointer registers (Fptr and Bptr per priority),• Constants box (32 words),• Byte-aligner,• Pipeline control unit.

The byte-aligner is responsible for all part word data reads and data writes including theMOVE instructions. Therefore a 2 word data buffer is included here to allow any byteaddress to be a read source and/or a write target. For multi byte transfers always theminimum number of word reads and word writes will be used, best case is 2 clocks perword for continuous transfer (in internal RAM). Please note that after an interruptedMOVE for continuation either one or two extra reads are required (depending on wetherthe source and destination byte addresses are word addresses or not).

2.12 System-Control

The system-control unit (Figure 8) decides at every clock which next action should be takenby the processor. This could be (in order of action priority):

• starting the processor (after Reset),• continue instruction execution with subsequent clocks (NEXTACTION = ‘0’),• Halt-on-Error (note: this causes an endless NOP loop),• event handling,• no operation (in case of both pre-fetch buffers are empty), or• decoding the next instruction (NEXTACTION = ‘1’).

The most important trigger signal is the micro-operation NEXTACTION which will be setto ‘1’ in the last clock of each instruction. Note: The execution of an instruction is stillongoing as long as the micro-word signal NEXTACTION is ‘0’.

For event handling an extra priority and synchronization logic is required to provide anappropriate microcode entry-address. Therefore the system-control unit is supervising anddetermining the whole processor status. Some decoded instruction status information goesdirectly from the instruction decoder to the system-control unit. This should prevent theinterruption of PFIX and/or NFIX (operand accumulation) instructions due to they have to

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 12: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

12 U. Mielke et al. / T42 – Transputer in FPGA

be considered as atomic operations together with their subsequent instruction. The system-control path contains the following components:

• system-control unit,• status-register,• timers,• system services,

PrioEncoder

Mic

roco

de

RO

M

Mic

ro-I

nst

ruct

ion

Re

gist

er(M

IR)

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

Reset

N/C

NEXTACTION

HLToE

N/C

NOP

&

Start Processor eAddr.: 0x1FF

N/C

current Instruction‘s follow-up eAddr.

Halt-on-Error eAddr.: 0x0FF

N/C

Event eAddr.

NOP eAddr.: 0x0FF

from Instruction Decoder: next Instruction‘s eAddr.

Entry-Address

Micro-Code

ROMFEEDBAKEvent-Sync-Logic

fromInstructionDecoder:Deco_PNfix_reg

Da

ta-P

ath

Entry Address Mux

Event-EAddr.ROM

Figure 8. T42 System-Control Unit (simplified) with Priority Encoder.

The T42 processor status-register contains much more bits than visible to the programmer.Here reside additional bits, for instance to remember which MOVE2D is ongoing, thequotient-sign for DIV and the dividend-sign for the REM-instruction and many control-bitsfor the behaviour of timers and links.

System services will support internal clock management and pin interfaces to the outerworld for receiving hardware signals like Reset, Analyse, Error and external Event.

2.13 Timer

The timer unit contains a (system clock dependant) pre-scaler, a high-priority timer (1µstick), a low-priority timer (64µs tick) and the time-slice support logic (1ms). The timers aredisabled after reset and can be enabled / disabled by processor instructions.

Note: the currently still ongoing T42 design work (while writing this paper) is focussing onthe link path VHDL. Therefore the microcode for supporting the timer and timer hardwareverification is not completed yet.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 13: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 13

2.14 Memory Path

Altogether there are 6 read-address and 5 write-address sources which access two memoryspaces, the internal RAM and (via cache) the external on FPGA-board DDR-SDRAM):

• CPU: read, write• Pre-fetch: read, -• 4x in-channel DMA: - , write• 4x out-channel DMA: read, -

A memory arbiter will handle all requests and concurrent access to internal and externalmemory is possible.

The 16 way associative unified cache will answer all read requests to the externalmemory in-order (FIFO) and write requests will force a write through. One outstandingread request per address source is allowed. This first version of our cache controller is asimple one, LRU (least recently used) is supported as cache strategy, but no automatic pre-fetch and no write-back is implemented yet. Cache and DDR-SDRAM controller areconnected by a 128bit bus, which is the size of one cache line. Thereby an optimum ofmemory performance can be achieved. Cache controller and DDR-SDRAM controllerinterface are part of the open source PoC library [5] from TU Dresden, Chair of VLSI-Design, Diagnostics and Architecture, and require PoC-specific interface signals. Thewhole Memory-Path consists of the following units:

• Address Decoder,• internal Memory Arbiter,• internal RAM (2kB … 8kB),• external Memory Arbiter,• external PoC interface,• Read Data Distributor.

The following components are from PoC Library [5]:

• FIFO buffer• Cache memory (8kB … 32kB),• Cache controller,• DDR-SDRAM Controller interface

The DDR-SDRAM controller will be customised and generated based on a Xilinx MCB(memory control block). The available external DDR-SDRAM type and capacity isdepending on FPGA board.

2.15 Memory Bandwidth

On the original IMS-T425 memory bandwidth was well balanced with about 50% for theCPU (~20% fetch ~30% data) and ~15% for the 8 link-channel DMAs (i.e. ~35% reservefor CPU MOVEs or slower external memory).

For the T42 a memory bandwidth calculation looks different due to the link speedbeing quite variable. For on chip (FPGA internal) core to core link connections a byte wideinterface with handshake will be used. If a bandwidth of one byte per link-channel perclock would be allowed, this would override the maximum available T42 core to memory

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 14: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

14 U. Mielke et al. / T42 – Transputer in FPGA

bandwidth twice. Therefore either a bandwidth limitation for the links is necessary or astrong priority policy to favour CPU memory requests over the links. Off FPGA-boardserial link communication could be considered relaxed as far as in the 20-40Mbit/s range.

2.16 Link Path

Figure 9 shows an overview of the Link Path. The 8 link-channel DMAs (4x in-channel, 4xout-channel) are running concurrently. The Z-bus will be used for CPU commanddownstream to the links (see TESTHARDCHAN [31]). All links together share the Wbus astheir own address bus. The Ubus will be shared for data upstream from the links to the CPUand sending data to memory. A shared Lbus will send data from the memory to the links. Alink channel synchronization logic is required to manage all link-channel event signalsbased on link-channel priority. Two round robin arbiters (one per priority level) will handlelink channel DMA memory requests.

LinkChannelsMemoryArbiter

LinkChannels

SyncLogic

Link 0

Link 1

Link 2

Link 3

Zbus (Data)

Lbus (Data)

Wbus (Addr)

Ubus (Data)

Bus-Ctrl & Tag

Signals & Busses from/to 8x DMA

Status-Signals to Cond.MUX

from MicroCode ROM

Requests & Prio

Channel Event Signalsto

System Control Unit

ByteI/O

OS-SerDes(optional)

OS-SerDes(optional)

OS-SerDes(optional)

OS-SerDes(optional)

toMemory

Path

Figure 9. T42 Link Path (simplified).

Each link-channel DMA consists of a 3 register stack which will hold a data word, anaddress for memory access and a byte count. Input / output link-channel DMA behaviour iscontrolled by different state machines and microcode interaction [19]. Each link channelDMA has either an in-byte or out-byte interface (plus handshake signals) to the outer world.Here a neighbouring Transputer may be directly connected, or a serialiser / deserialiser for“off FPGA-board” serial link connection. The serial link speed is currently limited to 20Mbit/s for compatibility reason to allow connection and tests with legacy hardware.

Note: the currently still ongoing T42 design work (while writing this paper) is focussing onthe link path VHDL and not completed yet, as well as the microcode to operate the links.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 15: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 15

3. Verification

For a comprehensive design-verification, as much time should be spent as for the designitself. Several verification strategies have been used in parallel to adequately meet differentverification requirements over the whole design time.

3.1 VHDL Test Benches

Simple test benches for almost any single VHDL block have been used in the early stagesof the design. Here microcode stimulation for the early data path was done “by hand”. Afteran almost working CPU core becomes available more and more hand written assemblercode snippets were used to verify dedicated hardware blocks in conjunction. This enabledthe following to be verified:

• tb_01: ALU (data operations and address calculation)• tb_02: data-path and tb:03: control-path (still separately, w/o microcode-ROM)• tb_04: ctrl2data (only 8 instructions from pre-fetch buffer, still w/o pre-fetch FSM)• tb_05: data-path (instructions like LEND, LADD … from on-chip RAM)• tb_06: byte-aligner (instructions: MOVE, MOVE2Ds… from on-chip RAM)

After the design has become more complex, TVS-1 (tb_07) regression test bench was used.

3.2 TVS-1 – Transputer Verification Suite

TVS-1 is a package of well-assorted assembler snippets, which are capable of verifying theexecution results of 54 selected instructions versus their golden reference from a real IMS-T425. The original TVS-1 requires a host computer. The Transputer will be booted overlink and each of the individual instruction tests will run in isolation. The tested Transputerwill transfer the results via link back to the host computer, where the comparison versus therespective golden reference takes place. TVS-1 covers 54 instructions (only exampleslisted, for a complete overview please see Appendix B.):

• primary (3 of 15 available) ldc, adc, eqc,• arithmetic/logic (16/17) add, gt, xor, …• long arithmetic (9/9) ladd, ldiv, norm, …• indexing (5/8) bsub, wsubdb, wcnt, …• error handling (2/8) ccnt1, csub0,• general (7/8) csngl, xword, pop, …• CRC and bits (5/5) bitcnt, crcword…• floating point (5/6) unpack, postnorm, …• ALTs (2/12) alt, talt.

Each IUT (instruction under test) requires up to 4 operands. Therefore the original TVS-1contains a sample set of 128 predefined 32bit integer values including corner cases aroundminimal integer (MINT), zero, maximal integer, several random bit-pattern, values with asingle bit ‘1’ and ‘0’, and some small and large integers. There are 8 sample input filesrequired to meet all of the different IUT operand requirements for arithmetic & logic, shifts,range check, long arithmetic and floating point operations. The basic principle of TVS-1 foreach individual test is as follows:

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 16: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

16 U. Mielke et al. / T42 – Transputer in FPGA

; load test (input from host via link-channel into workspace)

ldl CREG

ldl BREG

ldl AREG

__IUT__

stl AREG

stl BREG

stl CREG

testerr

stl ERROR

; send result (output from workspace via link-channel to host)

The original TVS-1 provides altogether about 2.273.730 individual tests.

Note: the currently still ongoing T42 design work (while writing this paper) is focussing onthe link path VHDL. Therefore boot over link is still not available, which would berequired to run the original TVS-1 with a T42 on FPGA-board at full clock speed.

Info: TVS-1 was written by Michael Bruestle in 2010 for verification of various TransputerEmulators.

3.3 TVS-1 as Regression Test Bench

For verification of design conformance versus a specification, the use of a regression testbench is the common approach. This test bench will be run after each significant VHDLmodification to show up if the design will still meet the specification or not. In the case ofour TVS-1 regression test bench the Transputer instruction set definition is the specificationand our T42 hardware should execute all the 54 TVS-1 IUTs in their correct manner.

An adaption of the original TVS-1 (i.e. to run without links) was required for VHDLsimulation. Furthermore the original TVS-1 basic sample size had to be reduced from 128to 32 integer values to achieve suitable simulation run times. The TVS-1 regression testbench provides altogether about 96.834 individual tests. A full run with GHDL [77] assimulator takes about 1-2 hours.

The XILINX ISE development suite [74] allows loading of predefined RAM and ROMcontents for simulation. IUT assembler code and sample set data will be loaded into twodifferent on-chip SRAM blocks before each IUT simulation run. During simulation theoutput data is compared word-by-word with the respective golden reference file from a realTransputer. Any difference will stop the simulation and point to the test data set causing thesimulation error.

For our regression test bench the TVS-1 had to be wrapped into a script system for allthe necessary memory-data and simulation-run preparations. The TVS-1 inner principleremains the same, whereas a more complicated assembler program (as wrapper) is runningthe nested loops (reading the 64 up to 4096 data sets) per IUT:

; load test program and data into memories

; start simulation (executing a big loop over many data sets per IUT)

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 17: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 17

ldl CREG

ldl BREG

ldl AREG

__IUT__

stl AREG

stl BREG

stl CREG

testerr

stl ERROR

; test bench internal result comparison after each inner IUT loop

; continue outer loop until test data set is completed

To improve the T42 verification quality, three independent simulators have been used andtheir results have been compared for all TVS-1 verification runs:

• Xilinx iSim lite – part of Xilinx ISE design suite webpack edition [74],• GHDL – an open source VHDL simulator [77],• Mentor Graphics ModelSim PE student – part of ALTERA Quartus design suite

lite edition [76]. Note: PE student version is slowed down by a factor of 100 (fordesigns larger than 10000 lines) vs the commercial full speed version. P.S.: Thisapplies for our T42 design already!

TVS-1 was very helpful to refine the T42 data-path while implementing and debuggingsome of the more complex instructions like MULs, DIVs, CRCs and Floating Point supportincluding IEEE-754 round-to-nearest.

4. Synthesis Results

4.1 Target FPGA Boards

The following target FPGA-boards have been the first choice for tests with our T42:

• Digilent ATLYS [62] – tested☺ Ok.FPGA: XILINX XS6LX45-3-CSG324C (Spartan-6) speed grade 3 (fast)Memory: 64Mx16bit DDR2 @ 800MHz = 128MB

• Digilent NEXYS-4-DDR [63] – tested☺ Ok.FPGA: XILINX XC7A100T-1-CSG324C (Artix-7) speed grade 1 (slow)Memory: 64Mx16bit DDR2 @ 667MHz = 128MB

• Digilent ARTY [64] – not tested yetFPGA: XILINX XC7A35T-1-CSG324C (Artix-7) speed grade 1 (slow)Memory: 128Mx16bit DDR3L @ 667MHz = 256MB

The promising Digilent ARTY as the 3rd board could not be tested yet. The board has aRasberry-Pi form-factor plus shield connectors and is available for 105 EUR. There are twonew ARTY versions on the market meanwhile, one with a bigger Artix-7 XC7A100T-1-

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 18: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

18 U. Mielke et al. / T42 – Transputer in FPGA

CSG324C FPGA for about 199EUR and one with a Spartan-7 XC7S50-1-CSGA324C forabout 115EUR. Because of similar 7-series FPGAs are used and the speed grade is like onNEXYS-4-DDR, similar results should be expected.

At the end of chapter 4 we will give in Table 6 an overview of some more selectedFPGA boards suitable for the T42.

4.2 T42 Synthesis as Embedded Core

Within the FPGA design flow there are always two different timing reports available, oneafter synthesis and another one after place & route. The timing is always calculated forworst case operation conditions (i.e. maximum allowed ambient temperature and minimumVDD). Please note that all Spartan-6 FPGAs have speed grade 3 (fastest) but Artix-7 FPGAXC7A110T-1 is speed grade 1 (slowest). For comparison reasons a synthesis for ahypothetic XC7A110T-3 with speed grade 3 has been done as well.

First a “bare T42” synthesis, place & route as embedded core with 32kB internal RAMwas performed (embedded configuration). Here the cache and DDR-SDRAM controllerhave been omitted. Only some FPGA-board resources like LEDs and 7-segment displayshave been connected to memory mapped registers. This configuration can be used later as abasis for instance to connect many T42 cores.

The timing report after synthesis provides the theoretical limiting critical path of thedesign itself and the information of a theoretical clock frequency range to be expected.

Table 1. Theoretical Timing Results after Synthesis (T42 Embedded Configuration) ISE 14.7

after synthesiscritical path

delaylogic route

minimumperiod

frequency

FPGA: [ns] [ns] % [ns] % [ns] [MHz]

Spartan-6: XC6SLX45-3 14,283 5,044 35,3% 9,239 64,7% 14,283 70,013

Artix-7: XC7A100T-1 12,320 6,550 53,2% 5,770 46,8% 12,320 81,169

Artix-7: XC7A100T-3 9,468 4,887 51,6% 4,581 48,4% 9,468 105,619

The timing report after place & route (implementation) provides the physically achievableclock frequency on FPGA-board, which is limited by available FPGA routing resources anda critical path caused by place and route problems and further I/O delays:

Table 2. Final Timing Results after Place & Route (T42 Embedded Configuration) ISE 14.7

after map & routecritical path

delaylogic route

minimumperiod

frequency

FPGA: [ns] [ns] % [ns] % [ns] [MHz]

Spartan-6: XC6SLX45-3 16,177 5,320 32,9% 10,857 67,1% 16,248 61,546

Artix-7: XC7A100T-1 14,662 5,900 40,2% 8,762 59,8% 14,660 68,213

Artix-7: XC7A100T-3 10,616 4,44 41,8% 6,176 58,2% 10,680 93,633

Discussion (Table 1 and Table 2):1) Xilinx ISE 14.7 timing report after synthesis shows almost the same critical path for

both FPGAs on Spartan-6 and Artix-7 target boards. The reason is our intended verylong execute pipeline stage with signals traversing the whole data path.

2) Timing results for Spartan-6 and Artix-7 differ reasonable from synthesis to place &route. On Artix-7 the significant difference between speed grades becomes visible.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 19: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 19

Note: Xilinx ISE 14.7 (last version from 2013) is unmaintained since about 5 years and hasnot received latest improvements and updates concerning Artix-7. This includes necessaryIP’s as well, like MCB and PLL required for the DDR-SD-RAM controller. We hope to doa trial with Xilinx Vivado design suite [75] latest version from 2018 soon. This will help toinvestigate if better timing results may be feasible.

4.3 T42 Synthesis with External DDR-SDRAM

Here the T42 was synthesised as a single core with 8kb local SRAM, Cache and DDR-SDRAM controller (IMS-B004 configuration).

The timing report after synthesis provides the theoretical limiting critical path of thedesign itself and the information of a theoretical clock frequency range to be expected.

Table 3. Theoretical Timing Results after Synthesis only (T42 in B004 Configuration) ISE 14.7

after synthesiscritical path

delaylogic route

minimumperiod

1) frequency

FPGA: [ns] [ns] % [ns] % [ns] [MHz]

Spartan-6: XC6SLX45-3 17,936 5,433 30,3% 12,503 69,7% 17,936 55,754

Artix-7: XC7A100T-1 14,905 6,720 45,1% 8,185 54,9% 14,905 67,092

The timing report after place & route (implementation) provides the physically achievableclock frequency on FPGA-board, which is limited by available FPGA routing resources anda critical path caused by place and route problems and further I/O delays:

Table 4. Final Timing Results after Place & Route (T42 in B004 Configuration) ISE 14.7

after map & routecritical path

delaylogic route

minimumperiod

1) frequency

FPGA: [ns] [ns] % [ns] % [ns] [MHz]

Spartan-6: XC6SLX45-3 16,445 2,452 14,9% 13,993 85,1% 16,585 60,295

Artix-7: XC7A100T-1 19,777 6,820 34,5% 12,957 65,5% 19,889 50,279

Discussion (Table 3 and Table 4):1) Xilinx ISE 14.7 timing report after synthesis shows almost the same critical path for

both FPGAs on Spartan-6 and Artix-7 target boards. The reason seems to be amemory request to grant signal path delay related to our PoC memory interface.

2) Timing results for Spartan-6 and Artix-7 differ reasonable from synthesis to place &route. On Spartan-6 the place & route optimisation has improved the final timing.

T42 processor clock on ATLYS Board with XS6LX45-3-CSG324C was set to 60 MHz.The processor clock is generated from DDR2 memory clock by a variable integer divider,here: 300 MHz : 5 = 60 MHz. Note: the maximum allowed, theoretical DDR2 clock onATLYS is 400MHz (1.6GB/s for one 16bit DDR-SDRAM).

T42 processor clock on NEXYS-4-DDR w/ XC7A100-1-CSG324C was set to 50 MHz.The processor clock is generated from DDR2 memory clock by a fixed integer divider,here: 200 MHz : 4 = 50 MHz, currently required by the Xilinx Soft-MCB v1.9 (memorycontrol block). Note: the maximum allowed, theoretical DDR2 clock on NEXYS-4 is350MHz (1.4GB/s for one 16bit DDR-SDRAM).

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 20: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

20 U. Mielke et al. / T42 – Transputer in FPGA

4.4 Testing the T42 on FPGA-Board Hardware

For several software tests on real hardware the T42 was synthesised as a single core withCache and DDR-SDRAM controller (IMS-B004 configuration). The allocated FPGAresources are reported in Table 5.

Table 5. FPGA utilization after Synthesis, Place & Route of one T42 Core (B004 Configuration)

B004 Configuration Spartan-6 XC6LX45 Artix-7 XC7A100T

(after Map & Route) LUT1)

BRAM LUT1)

BRAM

FPGA Capacity instances 27228 116 instances 63400 270

T42 core 1 1926 12 1 2071 12

T42 links (estimated) 1 1600 1 1700

32kB Cache RAM 1 16 1 16

Cache-Controller (16 way) 1 4914 1 4866

DDR/2/3 Controller 1 295 1 3387

(multi bank capable)

sum 8735 28 12024 28

% FPGA utilization 32% 24% 19% 10%

Each FPGA-board has up to 8 LEDs and sometimes a couple of 7-segment displays whichare directly accessible from the FPGA and useful to monitor the running test code. TheXILINX ISE design suite allows loading of predefined RAM and ROM contents togetherwith the FPGA configuration. Therefore the T42 as a device under test (DUT) can executequite large assembler snippets from the T42 internal 8kB SRAM.

• Test-1: Simple Address WalkThereby the interface hardware between our T42 and the DDR-SDRAM on thetarget FPGA-board should be tested. A simple address walk test was written in 500bytes of assembler code. The memory is 1.) cleared, 2.) written and 3.) read(verified) in blocks of 1Mbyte. Health-status is shown by 6 LEDs. The 3 lowerLEDs are for above 3 steps of the program. LED on/off status is altered each 1 MBand successful finish will switch the respective LED permanent on. The programwill be repeated in an endless loop - or - if a read fail occurs the test will stop andswitch the 2 highest LEDs on to represent “fail”. Where available, the actual (orfail) block address is written into 7-segment displays.Results: Program runtime at 50 and 60 MHz is about 25+ seconds for one full pass.

Note: the currently still ongoing T42 design work (while writing this paper) is focussing onthe link path VHDL. Therefore the following described tests are planned, but could not beexercised yet.

• Test-2: Boot over LinkThe T42 link hardware should be tested at 20MHz by use of a host PC and a legacyIMS-B008 board populated with an IMS-B426 TRAM. Link-0 of the T42 will bewired to one PMOD connector of the FPGA board. A voltage level converter(PMOD adapter board) will be used to interface the 5V legacy board to the 3.3V (or2.5V) T42-in-FPGA I/O pins.Results: still N/A.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 21: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 21

• Test-3: Running Transputer SoftwareIf boot over link is available once a day … Transputer software could be loadedeasily and tested for compatibility with our T42 design. This would allow us to runstill available Transputer original software, like the Inmos occam-2 native compiler[40].Results: still N/A.

4.5 More T42 in FPGA

The T42 was targeted for loading many interconnected T42 cores into any suitable sizedFPGA. As can be seen from Table 5 the design footprint of cache and DDR-SDRAMcontroller is quite large, especially on Xilinx 7-Series devices where no hardware MCB(memory control block) is available, and the controller has to be synthesised in LUTs. TheLUT ratio between T42 core to cache to SDRAM controller can be drawn as 1:1:1 now.Possibly a smaller, size-optimised cache can be made available in future.

Table 6 gives a listing of selected FPGA-boards and some of them had an academicprice option. Most interesting are the cheap boards with an XC7A100T FPGA because upto 16 bare T42 cores would fit-in and with cache and DDR-SDRAM controller still 6 fullT42 cores w/ Cache may be connected together including a still necessary (distributed!)memory arbiter to share the on-board DDR-SDRAM.

The last two lines of Table 6 show the number of T42 cores which can be installed oneach FPGA board for our above configurations. When omitting Cache and DDR-SDRAMcontroller then the number of bare T42 cores per FPGA can be increased significantly, ascan be seen from the bottom line.

Table 6. Overview of some selected XILINX FPGA Boards for Education

FPGA Boards(Xilinx FPGA)

S3EStarter

S6 MicroBoard

NEXYS-3 ATLYSNEXYS-4

DDRARTY

A7GENESYS

Vendor Digilent Avnet Digilent Digilent Digilent Digilent Digilent

Price 150USD 89USD 270USD (300USD) 275EUR*)

105EUR 1109EUR*)

Generation Spartan3E Spartan6 Spartan6 Spartan6 Artix7 Artix7 Kintex7

Tech-Node 90nm 45nm 45nm 45nm 28nm 28nm 28nm

internal speed 250MHz 500MHz 500MHz 500MHz 450MHz 450MHz 450MHz+

FPGA XC3S500E XC6SLX9 XC6SLX16 XC6SLX45 XC7A100T XC7A35T XC7K325T

LUTs 4456 5720 9112 27288 63400 20800 203800

BRAM 18Kb 20 32 32 116 270 100 900

DSPs 20 16 32 58 240 90 840

ext. clock 50MHz 100MHz 100MHz 100MHz 100MHz 100MHz 200MHz

DDR-RAM32Mx16bitDDR-200

32Mx16bitLPDDR2

8MBx16bitcellularRAM

64Mx16bitDDR2-800

64Mx16bitDDR2-666

128Mx16bitDDR3L

256Mx32bitDDR3-1600

DDR clock 100MHz+ 200MHz 80MHz 400MHz 325MHz 325MHz 800MHz

max. Mem.Bandwidth

0.4GB/s 0.8GB/s pipelined 1.6GB/s 1.3GB/s 1.3GB/s 6.4GB/s

full1)

T42 fit-in t.b.d. t.b.d. 1,1 3,2 6,9 2,0 23,2

bare2)

T42 fit-in 1,3 1,6 2,6 7,7 16,8 5,5 54,0

Note: all price values are exclusive VAT

Remarks:

*) academic board is available for 50% of the price

1) full T42 with Cache and DDR-SDRAM controller (B004 configuration)

2) bare T42 w/o Cache and DDR-DRAM controller, but extra SRAM (embedded core configuration)

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 22: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

22 U. Mielke et al. / T42 – Transputer in FPGA

4.6 T42 synthesised as embedded Core in a CMOS Technology

We always believed that a Transputer VHDL could be beneficial for teaching computerarchitecture. During our still ongoing T42 project we have derived 2016 and 2017 twostudent projects, using an intermediate processor design status (with partial instruction setonly) to gain a semester paper [50] about verification and a bachelor thesis [49] aboutlayout synthesis. We have learned that our T42 embedded core with 16kB of DPSRAM(still w/o links) could be adapted and synthesised in a state-of-the-art 130nm CMOStechnology in about 3-4 months. The theoretical (non-optimised) size on silicon (i.e. coreonly w/o pads) would be about 0,480 mm² plus 0,925 mm² for 2x 8kB dual-port SRAM. Ifonly a single 4kB SPSRAM would have been used, the overall core size would be less than1mm².

5. Related Designs

Since discontinuation of the Transputer in production by STM around the year 2000,several Inmos IMS-T425 compatible Transputer implementations as soft-core in FPGAhave appeared. As far we know none of these designs has been made public available yet.A comparison of all designs related to our T42 in FPGA can be found in Table 7.

5.1 TPCORE (2004-2009)

TPCORE was designed at Tokyo Metropolitan University, Japan, and first reported onCPA-2004 [55].

TPCORE executes 96 (of 134) instructions from IMS-T425 instruction set and wasimplemented in Verilog and VHDL. Four OS-links are available. Clock frequencies up to24 MHz on a Xilinx Virtex-II FPGA have been reported in 2004.

TPCORE-2 is an improved version and executes (most probably) the full Inmos T425instruction set. Beside OS-links the core has an optional VCP (virtual channel processor) aswell as DS-Links. Clock frequencies up to 50 MHz on Xilinx Virtex-4 have been reportedin 2009. 16 TPCORE-2 can be implemented together with an IEEE-1355 compatibleSpaceWire router in a single Xilinx XC4V-LX160 FPGA [56, 57]. TPCORE has noexternal (DDR-SDRAM) memory connection.

5.2 Open Transputer (2014)

Open Transputer was designed in a bachelor thesis at Bristol University, U.K, and firstreported on CPA-2014 [58-60].

The Open-Transputer executes 67 (of 134) instructions from IMS-T425 instruction setand was implemented in Verilog. A clock frequency up to 41 MHz has been achieved on aZedBoard with XC7Z020-1-CLG484C FPGA (speed grade 1).

For the Open-Transputer only one single (byte-parallel) custom link per core wasimplemented. This (byte wide) link could be routed through a Benes network to connectother Open-Transputers. A special, custom channel protocol has been developed therefore(which is not Inmos compatible). The Open Transputer has no external (DDR-SDRAM)memory connection.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 23: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 23

5.3 T42 Performance

When looking at real performance numbers (Table 8) – for instance instructions per clock –neither the historical IMS-T425 nor our T42-in-FPGA can compete with modern state-of-the-art (almost RISC) CPUs of today. Our T42 has only a very simple two-stage pipeline,while modern RISC CPUs have a 5 or 6 stage pipeline, even as soft core in FPGA. Fromour point of view, the value of a T42-in-FPGA lies more in its possibilities for educationand exploration. The big advantage of the T42 is its ability to interconnect more than twoCPUs and the only limit is the FPGA size.

Nevertheless, if looking at other soft-cores in FPGA our T42 is not too bad. A T42single core (IMS-B004 configuration) achieved already 60 MHz on a Sparten-6 (speedgrade 3) ATLYS-board and can be run with 50 MHz on an Artix-7 (speed grade 1)NEXYS-4-DDR board currently.

Within last few years new (commercial) soft-cores got freely available for education.We have selected 3 of those, which are advertised specifically for our NEXY-4-DDRboard, including their mentioned clock frequencies. These modern RISC CPUs have avariable hardware environment incl. MMU, FPU, caches, I/O-peripherals and a fullsoftware eco-system (compilers plus Linux as operating system):

• Gaisler LEON-3 (SPARC V8), 7-stage pipeline, CPU clock 50 MHz [71],• MIPSfpga (MIPS), 5 stage pipeline, CPU clock 50 MHz [72],• SiFive E310 (RISC-V), 5-stage pipeline, CPU clock 65 MHz [73].

Table 8 gives a selected overview of direct supported soft-core CPUs from FPGA vendors[67-70]. PicoBlaze is a tiny 8bit CPU which is optimised for minimum LUT footprint onXilinx FPGAs. Each instruction has a deterministic cycle of 2 clocks.

MicroBlaze and NIOS-II are full blown 32bit RISC CPUs with a variable hardwareenvironment incl. MMU, FPU, caches, I/O-peripherals, a full software eco-system andLinux. Different CPU flavors are available including a SMP dual-core option. Heavyoptimisations have been done by the FPGA vendors to maximise the performance of theirsoft-core CPUs.

Table 7. Comparison of IMS-T425 compatible Soft Cores in FPGA

XilinxFamily

TechNode

FPGA PipelineMax.Clock

LUTsT425

Instructions

TPCORE Virtex-II 130nm XC2V2000 2-stage ? 24MHz1)

~130002)

96 of 134

TPCORE-2 Virtex-4 90nm XC4VLX160 2-stage ? 50MHz ~ 80003)

134 t.b.d.

Open Transputer Artix-7 28nm XC7Z020-1 2-stage ? 41MHz 92944)

67 of 134

T42 Spartan-6 45nm XC6SLX45-3 2-stage: ID,EX 60MHz5)

< 40006)

134

T42 Artix-7 28nm XC7A100T-1 2-stage: ID,EX 50MHz7)

< 40006)

134

Note: FPGAs from Spartan-6 have speed grade 3 (fast), FPGAs from Artix-7 have speed grade 1 (slow).

Remarks:

1) max. 31.5 MHz

2) estimated: one TPcore uses 64% of 21504 LUTs in one XC2V2000 FPGA (2 LUTs per Slice, 4 Slices per CLB)

3) estimated: one 16x16 SpaceWire router plus 16 TPcores will fit in 135168 LUTs of one XC4VLX160 FPGA

4) 5233 LUTs for Open Transputer CPU plus 4061 LUTs for one autonomous Link with Benes Network

5) max. 60.295 MHz for worst case operating conditions

6) 1926-2147 LUTs for CPU, 1600-1700 LUTs for 4 Links

7) max. 50.279 MHz for worst case operating conditions

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 24: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

24 U. Mielke et al. / T42 – Transputer in FPGA

Table 8a-c. Overview of selected FPGA-Vendor supported Soft-Cores in FPGA [67-70]

XilinxPicoBlaze

Tech-Node

Pipeline Fmax DMIPS LUTs

Spartan-3 90nm N/A 120MHz 60 from min. 52+

Spartan-6 45nm N/A 128MHz 64 depending

Artix-7 28nm N/A peripherals

Virtex-7 28nm N/A 232MHz 116

XilinxMicroBlaze

Tech-Node

Pipeline Fmax DMIPS LUTs

Spartan-3 90nm 3-stage 80MHz 65 742…4100+

Spartan-6 45nm 3-stage 195MHz depending

Artix-7 28nm 3 or 5-stage 153-226MHz 164...303 pipeline and

Virtex-7 28nm 3 or 5-stage 217-329MHz 232…441 peripherals

AlteraNIOS-II

Tech-Node

Pipeline Fmax DMIPS LUTs

Cyclone-II 90nm NIOS: 3-stage 50MHz 1200…4600+

Cyclone-IV 45nm 5 or 6-stage 150MHz 170 depending

Cyclone-V 28nm 5 or 6-stage 170-220MHZ 192…230 pipeline and

Stratix-V 28nm 5 or 6-stage 330-350MHz 385...400 peripherals

Note: The original IMS-T425 with 20MHz clock was advertised by Inmos having 10MIPS.

5.4 T42 Perspectives

Currently the T42 clock performance is limited by its originally intended two stagepipeline. The very long (many levels of logic) and heavy routed (many buses) executepipeline stage (data path) is determining the most critical path. In contrast the instructiondecode pipeline stage has much less than half of this delay time. More investigations areneeded to enable an optimisation of the different critical paths available in the current T42design.

A small CPU execution performance improvement by up to ca. 20% is possible, ifinstruction folding for only one PFIX instruction would be implemented. Therefore adedicated second instruction decoder circuit and a modification of pre-fetch plus Iptr logicshould be added to the IF/ID-stage.

The T42 VHDL and microcode was written (as far as possible) independent of theprocessor word length. After a few modifications later either a 16-bit or a 32-bit or a 64-bitTransputer may be synthesized.

For a smaller block RAM foot print in FPGA a once a day frozen T42 microcode couldbe compressed. This will reduce the number of used block RAMs, but requires additionallogic to decode the individual micro-operations. Therefore a passable impact on clockperformance should be taken into account.

The T42 design is already a “latent” Harvard-architecture with separate busses forinstruction and data. In case a higher instruction throughput is required in future, this optioncan be activated easily just by the connection of separate caches.

For a radically T42 clock performance improvement, a modified micro architecture isrequired (e.g. based on deeper pipelining) as done for the ST20 and its successor designs[33, 48]. Further potential performance improvements (e.g. based on local workspacecaching, advanced instruction folding and parallelism in instruction execution) are

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 25: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 25

described in literature [45, 47, 48]. These measures are potentially applicable in complexT42 successor designs in future, but may require a larger design team.

Another beneficial direction to further develop the T42 design could be the addition ofan IMS-T805 compatible Floating Point Unit. Such a “T81” design should be feasible for asmall team with adequate community support in a couple of years.

6. Summary

Our T42 open-source Transputer-in-FPGA is a full binary-code compatible design to theInmos IMS-T425 32bit version of the Transputer. Our self-developed microcode will coverthe full instruction set. Any still available Transputer software should run. Support of manycores in FPGA was intended from the beginning and quite a large number of interlinkedT42 cores can be loaded on suitable sized FPGA boards. For external memory extensionstate-of-the-art DDR-SDRAM can be connected to the T42 core. Therefore all initial designtargets have been achieved.

The T42 is mainly targeted for education and exploration purposes. The constructionand evaluation of parallel computing networks based on the Transputers distributedmemory architecture is possible (e.g. within a student semester project). The CSP-likeprogramming model of occam can be evaluated and compared to other programminglanguages.

Performance improvements are possible for potential T42 successor designs in future.A step by step approach may be suitable, for instance for an introduction of instructionfolding. Another meaningful target may be the addition of an IMS-T805 compatible FPU.

On our T42 to-do-list until the end of the year 2018 are still several topics: the designof the link channels is still in progress, several scheduler related microcode-routines have tobe written, verification of the timer is pending, the ability to run available originalTransputer software on the T42 has still to be proven, the publishing of our T42 opensource VHDL on the web (www.transputer.eu) has to be done and for sure a documentation(e.g. T42 user manual) has to be provided.

Acknowledgements

We would like to express our gratitude to the whole staff including all research studentsfrom the Chair of VLSI Design, Diagnostics and Architecture at the Faculty of ComputerScience, Technische Universitaet Dresden. There was always encouragement and supportduring the five years of design work on the T42. Special thanks are owed to Dr. SteffenKoehler and Patrick Lehmann for their valuable discussions about the Transputerinstruction set, dedicated design details and verification topics, Dr. Thomas Preusser for hiscomprehensive knowledge about computer arithmetic and finally Prof. Dr. Rainer G.Spallek for his overall courtesy on this project.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 26: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

26 U. Mielke et al. / T42 – Transputer in FPGA

References

[1] Martin Zabel, Thomas B. Preußer, Peter Reichel, Rainer G. Spallek, SHAP - Secure Hardware AgentPlatform. Institute of Computer Engineering, Technische Universität Dresden, Germany, 2007.http://shap.inf.tu-dresden.de

[2] Martin Zabel, Rainer G. Spallek, Application requirements and efficiency of embedded Java bytecodemulti-cores. In: JTRES'10: Proceedings of the 8th International Workshop on Java Technologies forReal-Time and Embedded Systems, ACM Press, 2010.

[3] Martin Zabel, Effiziente Mehrkernarchitektur für eingebettete Java-Bytecode-Prozessoren. Technische Universität Dresden, Dissertation, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-84156

[4] Thomas Preußer, Increasing the Performance and Predictability of the Code Execution on an EmbeddedJava Platform. TUD, Dissertation, 2011. http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-77425

[5] PoC “Pile of Cores” hardware functions library. Copyright 2007-2016 TU Dresden, Chair of VLSI-Design, Diagnostics and Architecture. https://github.com/VLSI-EDA/PoCPoC Docu-Website: http://poc-library.readthedocs.io/en/release

[6] David May, Simple 42 Instruction set. http://www.cs.bris.ac.uk/~dave/transputer.html[7] David May, Simple 42 Microarchitecture. 11.Mar.1982. At David May's Transputer Page.[8] David May, Simple 42 Microcode in VBC. At David May's Transputer Page.[9] David May, T414 High level microcode. At David May's Transputer Page.[10] Roger Shepherd, T425 Transputer Instruction Set Specification. 27.Sep.1988.[11] Gavin Crate, Transputer Emulator, Inmos S42. https://sites.google.com/site/transputeremulator[12] C.A.R.Hoare, Communicating Sequential Processes, ISBN: 0-131-53289-8, 21.Jun.2004.

http://usingcsp.com/cspbook[13] D. Pountain, D. May, A tutorial introduction to occam programming, ISBN: 0-632-01847-X, 1987.[14] G. Jones, M. Goldsmith, Programming in occam 2, ISBN: 0-13-730334-3, Prentice Hall (UK), 1988.[15] 72-OCC-045-02 - Occam-2 Reference Manual, ISBN: 0-13-629312-3, Prentice Hall (UK), 1988.[16] US-Pat-4704678 - Function set for a microcomputer. Inmos Ltd., 26.Nov.1982.[17] US-Pat-4724517 - Microcomputer with prefixing functions. Inmos Ltd., 26.Nov.1982.[18] US-Pat-4758948 - Microcomputer. Inmos Ltd., 19.Jul.1988. https://patents.google.com[19] US-Pat-4783734 - Computer with variable length process communication. Inmos Ltd., 08.Nov.1988.[20] US-Pat-4794526 - Microcomputer with priority scheduling. Inmos Ltd., 27.Dec.1988.[21] US-Pat-4989133 - System for executing time dependent processes. Inmos Ltd., 29.Jan.1991.[22] 42-1426-07 - IMS-T425 Data-Sheet. SGS-Thomson Microelectronics, Feb.1996.[23] 72-TRN-006-04 - Transputer Reference Manual. ISBN: 0-13-929001-X, Prentice Hall (UK), 1988.[24] 72-TRN-011-00 - Transputer Development System. ISBN: 0-13-928995-X, Prentice Hall (UK), 1988.[25] 72-TRN-119-05 - Transputer Instruction Set - A compiler writer's guide. ISBN: 0-13-929100-8, 1988.[26] 72-TCH-033-00 - Analyzing transputer networks. Inmos Ltd., www.transputer.net[27] 72-TCH-034-00 - Loading transputer networks. Inmos Ltd., www.transputer.net[28] 72-TCH-039-00 - Long arithmetic on the Transputer. Inmos Ltd., www.transputer.net[29] 72-TCH-048-03 - Transputer architecture. Inmos Ltd., Jul.87., www.transputer.net[30] 72-TCH-061-00 - Support for debugging and breakpointing in Transputers. Inmos Ltd., Jan.1989.[31] Guy Harriman, Transputer Instruction Set - Appendix. Inmos Ltd., 29.Nov.1988.[32] Philip Mattos, Transputer Instruction Set - Effects of various instructions on C, D, E registers. Oct.1987.[33] ST20C2 preliminary Instruction Set Reference Manual. ST Micro, ADCS 7301761C, 1997-2003.[34] D.A.P. Mitchell, J.A. Thompson, G.A. Manson, G.R. Brookes, Inside the Transputer. ISBN: 0-632-

01689-2, Blackwell Scientific Publications, 1990.[35] John Roberts, Transputer Assembly Programming. ISBN: 0-442-00872-4, 1992.[36] Ian Graham, Tim King, The Transputer Handbook. ISBN: 0-13-929134-2, Prentice Hall (UK), 1990.[37] H. Reinecke, J. Schreiner., Transputer-Leitfaden: Eine Einführung und umfassende Beschreibung.

ISBN: 3-446-16063-9, C. Hanser Verlag, München, 1991.[38] Heinz Ebert, Transputer & Occam - Handbuch für Systementwickler. ISBN: 3-88229-000-5, Heise

Verlag Hannover, 1993.[39] LSC Transputer ToolSet. Logical Systems Inc. Corvallis, OR 97339 USA. 1986-1991.[40] IMS Dx205 Occam 2 ToolSet (native compiler). Inmos Ltd., 1988.[41] IMS Dx305 Occam 2 ToolSet (cross compiler). SGS-Thomson Microelectronics. 1993.[42] The Kent Retargetable occam Compiler. Website: http://www.cs.kent.ac.uk/projects/ofa/kroc

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 27: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 27

[43] Helios - The Transputer OS. Recompiled by Axel Muhr: http://www.geekdot.com/helios-2[44] Mark Homewood, David May, David Shepherd, and Roger Shepherd. The IMS T800 Transputer. In:

IEEE Micro 7 (5), 1987, p.10–26.[45] Guy Harriman, The IMS T810 - a preliminary Survey. Inmos Ltd., Jan.1989.[46] David Shepherd, Verified Microcode Design. In: Journal of Microprocessors & Microsystems, Volume

14 Issue 10, Dec.1990, Pages 623-630.[47] The T9000 Transputer, David May, Roger Shepherd and Peter Thompson, Inmos Ltd., IEEE 1992.[48] N. Richardson, Lun Bin Huang, R. Hossain, J. Lewis, T. Zounes, Naresh Soni, The iCORE 520 MHz

Synthesizable CPU Core. DAC 2002, IEEE Micro, 2003.[49] Jonathan Goedl, Layout-Entwicklung für einen Open Source Mikrocontroller im 130nm Design-Flow.

Bachelor Thesis submitted to the Duale Hochschule Baden-Württemberg, Stuttgart, 12.09.2016.[50] Magdalena Baumann, Erstellung einer Verifikationsstrategie für einen Open Source Mikrocontroller.

Praxisarbeit submitted to the Duale Hochschule Baden-Württemberg, Stuttgart, 13.03.2017.[51] Sylvain Flieller, T.Node - an industrial version of SuperNode, Telmat Informatique, France, Computer

Physics Communications 57, 1989, North Holland, p.492-494.[52] Denis A. Nicole (University of Southampton, UK), Reconfigurable Transputer Processor Architectures.

In: HICSS-22 Hawaii, Proceedings (vol. I) IEEE Computer Society Press, 1989, pp 365-374.[53] D.Heidrich, J.C. Grossetie, Computing with T.Node Parallel Architecture, ISBN: 0-7923-1483-2, EUR-

13975, Kluwer Academic Publishers, Dordrecht, The Netherlands, 1991.[54] Anthony J. G. Hey, Supercomputing with transputers - past, present and future. In: Proceeding ICS '90

Proceedings of the 4th international conference on Supercomputing, 1991, Pages 479-489.[55] M. Tanaka et al., A Design of Transputer Core and its Implementation in an FPGA. In: Proceedings of

Communicating Process Architecture, 2004, Oxford, UK, p.361-372.[56] K. Tanaka et al., The Design and Performance of SpaceWire Router-Network using CSP. In:

Proceedings of 2nd SpaceWire Conference, 2008, Nara, Japan, p.141-144.[57] K. Tanaka et al., Development of a Network on Chip for Parallel Processing Systems. From:

Mathematics and Information Sciences, Tokyo Metropolitan University, Hachioji, Tokyo, Japan. 2009.[58] David Keller, Andrés Amaya García, OpenTransputer: reinventing a parallel machine from the past.

Master Thesis submitted to the University of Bristol, 25.Jun.2015.[59] Andrés Amaya García, Open Transputer: Reinventing a parallel machine from the past. Presentation

slides from CPA 2015.[60] Andrés Amaya García, David Keller and David May, OpenTransputer: reinventing a parallel machine

from the past. Paper submitted for CPA 2015 (pre-print).[61] Digilent NEXYS-3, https://reference.digilentinc.com/reference/programmable-logic/nexys-3/start[62] Digilent ATLYS, https://reference.digilentinc.com/reference/programmable-logic/atlys/start[63] NEXYS-4-DDR, https://reference.digilentinc.com/reference/programmable-logic/nexys-4-ddr/start[64] Digilent ARTY, https://reference.digilentinc.com/reference/programmable-logic/arty-a7/start[65] Xilinx, DS160 Spartan-6 Family Overview v2.0 - 25.Oct.2011. At: http://www.xilinx.com[66] Xilinx, DS180 Artix-7 Family Overview v2.6 - 27.Feb.2018. At: http://www.xilinx.com[67] Wikipedia, Soft Microprocessors. https://en.wikipedia.org/wiki/Soft_microprocessor[68] Xilinx, Ken Chapman, PicoBlaze - Creating Embedded Microcontrollers (PSM). May.2002.[69] Xilinx, MicroBlaze Soft Core Processor: http://www.xilinx.com/tools/microblaze.htm[70] Altera Processor Selector: https://www.altera.com/products/processors/overview.html[71] LEON-3, Gaisler, https://www.gaisler.com/index.php/products/processors/leon3[72] MIPSfpga, MIPS, https://www.mips.com/blog/mipsfpga[73] RISC-V, SiFive, https://www.sifive.com/documentation[74] XILINX ISE Design Suite v14.7_1015.1 - WebPack Edition - 23.Oct.2013 - www.xilinx.com[75] XILINX Vivado Design Suite HL 2018.2 - WebPack Edition - 18.Jun.2018 - www.xilinx.com[76] ALTERA Quartus lite v17.1.0.590 - Nov.2017 - www.altera.com[77] Tristan Gingold, GHDL simulator v0.35 - 14.Dec.2017 - http://ghdl.free.fr[78] GTKWave v3.3.87 - 2017 - http://gtkwave.sourceforge.net

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 28: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

28 U. Mielke et al. / T42 – Transputer in FPGA

Appendix A. List of T42 Micro-Operations (128bit Micro-Word) – Status: July 2018

ENTRYVALID; 127; 127; -- 1 bitNEXTACTION; 126; 126; -- 1 bitERROR_MODE; 125; 125; -- 1 bitERROR_FROM; 122; 124; -- 3 bitS_BIT_MODE; 120; 121; -- 2 bitS_BIT_FROM; 115; 119; -- 5 bit__GAP_02__; 114; 114; ++ 1I_PTR_FROM; 112; 113; -- 2 bitW_PTR_FROM; 111; 111; -- 1 bitWPTR0_FROM; 110; 110; -- 1 bit__GAP_03__; 107; 109; ++ 3X_BUS_FROM; 104; 106; -- 3 bitY_BUS_FROM; 101; 103; -- 3 bit__GAP_04__; 98; 100; ++ 3O_REG_FROM; 96; 97; -- 2 bit__GAP_05__; 95; 95; ++ 1A_REG_FROM; 92; 94; -- 3 bitA_SHIFT_IN; 89; 91; -- 3 bitB_REG_FROM; 86; 88; -- 3 bitB_SHIFT_IN; 84; 85; -- 2 bitC_REG_FROM; 81; 83; -- 3 bitC_SHIFT_IN; 79; 80; -- 2 bitD_REG_FROM; 76; 78; -- 3 bitCOUNTER_IN; 74; 75; -- 2 bitE_REG_FROM; 71; 73; -- 3 bitE_SHIFT_IN; 69; 70; -- 2 bit__GAP_06__; 64; 68; ++ 5CONST_FROM; 59; 63; -- 5 bitXYSIGN_EXT; 58; 58; -- 1 bitZ_FROM_ALU; 52; 57; -- 6 bit__GAP_07__; 46; 51; ++ 6CARRY_FROM; 44; 45; -- 2 bitCARRY_MODE; 43; 43; -- 1 bitPOINT_MODE; 42; 42; -- 1 bitPOINT_FROM; 40; 41; -- 2 bitPOINT_0_IS; 38; 39; -- 2 bit__GAP_08__; 36; 37; ++ 2CMUX1_FROM; 32; 35; -- 4 bitCMUX0_FROM; 28; 31; -- 4 bit__GAP_09__; 27; 27; ++ 1MADDR_MODE; 25; 26; -- 2 bitMDATA_MODE; 23; 24; -- 2 bitMDATA_FROM; 20; 22; -- 3 bit__GAP_10__; 19; 19; ++ 1LNK_U_MODE; 18; 18; -- 1 bitLNK_SELECT; 17; 17; -- 1 bitLNK_PRI_IN; 16; 16; -- 1 bitLNK_ACTION; 14; 15; -- 2 bitTIMER_MODE; 12; 13; -- 2 bitPRESC_MODE; 11; 11; -- 1 bit__GAP_11__; 10; 10; ++ 1ROMFEEDBAK; 0; 9; -- 10 bit

Note: T42 micro-operations are still subject to change and may be optimised with next design version.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 29: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 29

Appendix B. List of T42 Instructions and Execution Clocks – Status: July 2018

Table B.1. Primary Function Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

0X j 3 3 jump D

1X ldlp 1 1 load local pointer

2X pfix 1 1 prefix

3X ldnl 2 2 load non–local

4X ldc 1 1 load constant (TVS-1)

5X ldnlp 1 1 load non–local pointer

6X nfix 1 1 negative prefix

7X ldl 2 2 load local

8X adc 1 2 add constant E, (TVS-1)

9X call 7 6 call

AX cj 2 or 4 2 or 3 conditional jump (not taken / taken)

BX ajw 1 1 adjust workspace

CX eqc 2 2 equals constant (TVS-1)

DX stl 1 1 store local

EX stnl 2 2 store non-local

FX opr – – operate

D - the instruction is a descheduling point

E - the instruction will affect the Error flag

Table B.2. Arithmetic/Logical Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

24F6 and 1 1 and (TVS-1)

24FB or 1 1 or (TVS-1)

23F3 xor 1 1 exclusive or (TVS-1)

23F2 not 1 1 bitwise not (TVS-1)

24F1 shl n+2 n+1 shift left (TVS-1)

24F0 shr n+2 n+1 shift right (TVS-1)

F5 add 1 2 add E, (TVS-1)

FC sub 1 2 subtract E, (TVS-1)

25F3 mul 38 35 multiply E, (TVS-1)

27F2 fmul 35 37 fractional multiply (no rounding) E, (TVS-1)

40 39 fractional multiply (rounding) E, (TVS-1)

22FC div 39 40 divide E, (TVS-1)

21FF rem 37 41 remainder E, (TVS-1)

F9 gt 2 1 greater than (TVS-1)

F4 diff 1 1 difference (TVS-1)

25F2 sum 1 1 sum (TVS-1)

F8 prod b+4 b+3 product for positive register A (TVS-1)

F8 prod m+5 b+3 product for negative register A (TVS-1)

n - the number of (bit-)places shifted

b - bit number of the highest bit set in register A. Bit 0 is the least significant bit

m - bit number of the highest bit set in the absolute value of register A. Bit 0 is the least significant bit

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 30: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

30 U. Mielke et al. / T42 – Transputer in FPGA

Table B.3. Long Arithmetic Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

21F6 ladd 2 2 long add E, (TVS-1)

23F8 lsub 2 2 long subtract E, (TVS-1)

23F7 lsum 3 1 long sum (TVS-1)

24FF ldiff 3 1 long diff (TVS-1)

23F1 lmul 33 33 long multiply (TVS-1)

21FA ldiv 35 36 long divide E, (TVS-1)

23F6 lshl n+3 n+4 long shift left (n<32) (TVS-1)

n-28 n-28 long shift left (n=32) (TVS-1)

23F5 lshr n+3 n+4 long shift right (n<32) (TVS-1)

n-28 n-28 long shift right (n=32) (TVS-1)

21F9 norm n+5 n+3 normalise (n<32) (TVS-1)

n-26 n-28 normalise (n=32) (TVS-1)

3 3 normalise (n=64) (TVS-1)

Table B.4. CRC and Bit Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

27F4 crcword 35 35 calculate crc on word (TVS-1)

27F5 crcbyte 11 11 calculate crc on byte (TVS-1)

27F6 bitcnt b+2 b+2 count bits set in word (TVS-1)

27F7 bitrevword 36 35 reverse bits in word (TVS-1)

27F8 bitrevnbits n+4 n+2 reverse bottom n bits in word (TVS-1)

Table B.5. General Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

F0 rev 1 1 reverse (TVS-1)

23FA xword 4 3 extend to word (TVS-1)

25F6 cword 5 4 check word E, (TVS-1)

21FD xdble 2 2 extend to double (TVS-1)

24FC csngl 3 3 check single E, (TVS-1)

24F2 mint 1 1 minimum integer (TVS-1)

25FA dup 1 1 duplicate top of stack (TVS-1)

27F9 pop 1 1 pop processor stack (TVS-1)

Table B.6. Indexing/Array Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

F2 bsub 1 1 byte subscript (TVS-1)

FA wsub 2 1 word subscript (TVS-1)

28F1 wsubdb 3 2 form double word subscript (TVS-1)

23F4 bcnt 2 1 byte count (TVS-1)

23FF wcnt 5 3 word count (TVS-1)

F1 lb 5 2 load byte

23FB sb 4 2 store byte

24FA move 2w+8 2w+7 move message

w - the minimum number of words including part words, required to move the byte count of a message

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 31: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 31

Table B.7. 2D Block Move Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

25FB move2dinit 8 4 initialise data for 2D block move

25FC move2dall (2p+23)*r (2p+24)*r 2D block copy

25FD move2dnonzero (2p+23)*r (2p+24)*r 2D block copy non-zero bytes

25FE move2dzero (2p+23)*r (2p+25)*r 2D block copy zero bytes

p - the number of words per row

r - the number of rows

Table B.8. Control Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

22F0 ret 5 3 return

21FB ldpi 2 1 load pointer to instruction

23FC gajw 2 2 general adjust workspace

F6 gcall 4 3 general call

22F1 lend 10 7 loop end (loop) D

5 5 loop end (exit) D

Table B.9. Error Handling Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

21F3 csub0 2 2 check subscript from 0 E, (TVS-1)

24FD ccnt1 3 3 check count from 1 E, (TVS-1)

22F9 testerr 2 1 test error false and clear (no error)

3 2 test error false and clear (error)

21F0 seterr 1 1 set error E

25F5 stoperr 2 2 stop on error (no error) D

25F7 clrhalterr 1 1 clear halt–on–error

25F8 sethalterr 1 1 set halt–on–error

25F9 testhalterr 2 1 test halt–on–error

Table B.10. Processor Initialisation Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

22FA testpranal 2 t.b.d. test processor analyzing

23FE saveh 4 4 save high priority queue registers

23FD savel 4 4 save low priority queue registers

21F8 sthf 1 1 store high priority front pointer

25F0 sthb 1 1 store high priority back pointer

21FC stlf 1 1 store low priority front pointer

21F7 stlb 1 1 store low priority back pointer

25F4 sttimer 1 1 store timer

2127FC lddevid 1 1 load device identity

27FE ldmemstartval 1 1 load value of memstart address

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 32: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

32 U. Mielke et al. / T42 – Transputer in FPGA

Table B.11. Scheduling Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

FD startp 12 t.b.d. start process

F3 endp 13 t.b.d. end process D

23F9 runp 10 t.b.d. run process

21F5 stopp 11 t.b.d. stop process

21FE ldpri 1 1 load current priority

Table B.12. Input/Output Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

F7 in 2w+19 2w+15 input message D

FB out 2w+19 2w+18 output message D

FF outword 23 24 output word D

FE outbyte 23 23 output byte D

24F3 alt 2 2 alt start (TVS-1)

24F4 altwt 5 5 alt wait (channel ready) D

17 t.b.d. altwait (channel not ready) D

24F5 altend 4 4 alt end

24F9 enbs 3 3 enable skip

23F0 diss 4 5 disable skip

21F2 resetch 3 3 reset channel

24F8 enbc 7 6 enable channel (ready)

5 5 enable channel (not ready)

22FF disc 8 7 disable channel

Table B.13. Timer Handling Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

22F2 ldtimer 2 1 load timer

22FB tin 30 t.b.d. timer input (time future) D

4 t.b.d. timer input (time past) D

24FE talt 4 3 timer alt start (TVS-1)

25F1 taltwt 15 t.b.d. timer alt wait (time past) D

48 t.b.d. timer alt wait (time future) D

24F7 enbt 8 7 enable timer

22FE dist 23 t.b.d. disable timer

Table B.14. Floating Point Support Operation Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

27F3 cflerr 3 3 check floating point error E, (TVS-1)

29FC fptesterr 1 1 load value true (FPU not present)

26F3 unpacksn 15 11 unpack single length fp number (TVS-1)

26FD roundsn 12/15 13 round single length fp number (TVS-1)

26FC postnormsn 5/30 6/28post–normalise correction of singlelength floating point number

(TVS-1)

27F1 ldinf 1 1 load single length infinity (TVS-1)

Processor cycles are shown as typical/maximum cycles.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.

Page 33: T42 Transputer in FPGAwotug.cs.unlv.edu/cpa2018/preprints/19-preprint.pdf · Abstract. The T42 Transputer in FPGA is a full binary-code compatible open-source VHDL implementation

U. Mielke et al. / T42 – Transputer in FPGA 33

Table B.15. Debugger Support Codes

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

00 jump0 3 3 jump0 (break not enabled) D

11 9jump0 (break enabled, highpriority)

13 9 jump0 (break enabled, low priority)

2BF1 break 9 7 break (high priority)

11 7 break (low priority)

2BF2 clrj0break 1 1 clear jump 0 break enable flag

2BF3 setj0break 1 1 set jump 0 break enable flag

2BF4 testj0break 2 1 test jump 0 break enable flag set

27FA timerdisableh 1 1 disable high priority timer interrupt

27FB timerdisablel 1 1 disable low priority timer interrupt

27FC timerenableh 6 1 enable high priority timer interrupt

27FD timerenablel 6 1 enable low priority timer interrupt

Table B.16. Undocumented Processor Test Instructions

MemoryCode

MnemonicProcessor

CyclesProcessor

CyclesName

Definition(Remark)

IMS-T425 T42 core

212FFF start ? t.b.d. start processor (RESET)

22F3 testlds 1 1 load status-register

22F4 testlde 1 1 load E register

22F5 testldd 1 1 load D register

22F6 teststs 1 1 store status-register

22F7 testste 1 1 store E register

22F8 teststd 1 1 store D register

22FD testhardchan 1 1 test hard channel stack

Remark:The T42-in-FPGA instruction execution is optimised for pipelined read access to internal SRAM memory with 2clock cycles, what is reflected in the above processor cycle values. All read access to external memory ismade via the cache which requires minimum 3 clock cycles and will add one processor stall cycle to eachsuccessful cache read hit. In case of a cache read miss the value of clock cycles until data arrives is dependingon the external memory access time.

Note: T42 processor cycles per instruction are still subject to change and microcode may be optimised withnext design version.

CPA 2018 preprint – the proceedings version will have other page numbers and may have minor differences.