View
214
Download
2
Category
Preview:
Citation preview
ASYNCHRONOUS MICROENGINES FOR NETWORK
PROCESSING
by
Niti Madan
A thesis submitted to the faculty ofThe University of Utah
in partial fulfillment of the requirements for the degree of
Master of Science
in
Computer Science
School of Computing
The University of Utah
May 2006
THE UNIVERSITY OF UTAH GRADUATE SCHOOL
SUPERVISORY COMMITTEE APPROVAL
of a thesis submitted by
Niti Madan
This thesis has been read by each member of the following supervisory committee and by majorityvote has been found to be satisfactory.
Chair: Erik Brunvand
Ganesh Gopalakrishnan
Al Davis
THE UNIVERSITY OF UTAH GRADUATE SCHOOL
FINAL READING APPROVAL
To the Graduate Council of the University of Utah:
I have read the thesis of Niti Madan in its final form and havefound that (1) its format, citations, and bibliographic style are consistent and acceptable; (2) itsillustrative materials including figures, tables, and charts are in place; and (3) the final manuscriptis satisfactory to the Supervisory Committee and is ready for submission to The Graduate School.
Date Erik BrunvandChair: Supervisory Committee
Approved for the Major Department
Christopher R. JohnsonChair/Director
Approved for the Graduate Council
David S. ChapmanDean of The Graduate School
ABSTRACT
We present a network processor architecture that is based on asynchronous microcoded
controller hardware (a.k.a asynchronous microengine). The focus of this work is not
on the processor architecture, but rather on the asynchronous microcoded style used to
build such an architecture. This circuit style tries to fill the performance gap between a
specialized ASIC (Application-Specific Integrated Circuit) and a more general network
processor implementation. It does this by providing a microcoded framework that is
close in performance to ASICs and is also programmable at the finer granularity of
microcode. Our approach exploits the inherent advantages of asynchronous design tech-
niques to exhibit modularity, average case completion time, lower power consumption
and low electromagnetic interference . We have evaluated our circuit style by demon-
strating fast-path IP (Internet Protocol) routing as the packet processing application. The
flexibility aspect of this design has been demonstrated by adding firewalling functionality
to the router by modifying the microcode. Each microengine core is specialized enough
for different packet processing kernels yet generic enough to handle newer protocols
and applications. For shorter design cycle time, we have implemented our design using
Xilinx SpartanII FPGA board. However, we are extrapolating our results for a best-guess
ASIC implementation.
CONTENTS
ABSTRACT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iv
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . x
ACKNOWLEDGMENTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii
CHAPTERS
1. INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2. BACKGROUND . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1 Asynchronous Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42.2 Advantages of Asynchronous Design . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.3 Asynchronous Microengine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72.3.2 Microengine Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.3 Next Address Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82.3.4 Microcode Fields . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.3.5 Control Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.4 Internet Protocol(IP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4.1 IP Header . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132.4.2 IP Over Ethernet Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152.4.3 IP Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.5 Click Modular Router . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6.1 Programmable Asynchronous Controllers . . . . . . . . . . . . . . . . . . . 202.6.2 Network Processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3. DESIGN AND IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.1 Router Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223.1.1 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2 Ingress Microengine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2.1 Datapaths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.2 Microprogram Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.2.3 Operation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Header-processing Microengine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.3.1 Datapaths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.2 Microprogram Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303.3.3 Operation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Design Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.4.1 Macro-modular Design Approach . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.2 VHDL-based Design Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.4.3 Memory Design Using Xilinx Core Generator . . . . . . . . . . . . . . . . 333.4.4 Bundled Delay Vs Completion Detection . . . . . . . . . . . . . . . . . . . 33
3.5 FPGA Resource Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4. EVALUATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.1 Evaluation of Async Advantages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.1.1 Demonstration of Flexibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 364.1.2 Demonstration of Average-case Completion Times . . . . . . . . . . . . 37
4.1.2.1 Synchronous Version of Microengine . . . . . . . . . . . . . . . . . . 374.1.3 Power Consumption Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Improving Asynchronous Microengines . . . . . . . . . . . . . . . . . . . . . . . . . 414.3 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3.1 Comparison with Click Software Router . . . . . . . . . . . . . . . . . . . . 434.3.1.1 Extrapolation of FPGA Results to an ASIC Version . . . . . . . 44
4.3.2 Comparison with Intel’s IXP1200 Network Processor . . . . . . . . . . 454.4 Throughput Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5. CONCLUSIONS AND FUTURE WORK . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
APPENDICES
A. MICROCODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
B. SOURCE CODE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
vii
LIST OF FIGURES
2.1 Microengine’s high level structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Microinstruction format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Execution control unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Next address unit Adapted from [14] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Branch detect unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.6 Local RAS block . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.7 IP header format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.8 Ethernet frame format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.9 Click modular IP router configuration Adapted from [18] . . . . . . . . . . . . . 17
3.1 High level architecture of the microengine-based router . . . . . . . . . . . . . . 23
3.2 Block diagram of ingress microengine . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.3 Send-to-fifo datapath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Block diagram of IP header processing microengine . . . . . . . . . . . . . . . . . 29
3.5 Alt-ring conditional construct . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.1 Datapath execution in the 6th microinstruction of IP header processingmicroengine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2 Microinstruction execution times of ingress microengine . . . . . . . . . . . . . 38
4.3 Microinstruction execution times of IP header processing microengine . . . 39
B.1 Source code of a 2-input c-element . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
B.2 Source code of IP-header processing microengine’s BDU . . . . . . . . . . . . . 60
B.3 Source code of next-address logic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
B.4 Source code of next-address logic contd. . . . . . . . . . . . . . . . . . . . . . . . . . 62
B.5 Source code of IP-header processing microengine’s microinstruction register 63
B.6 Source code of sel-addr toggle module . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
B.7 Source code of the address register . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
B.8 Source code of the 8-bit ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
B.9 Source code of the 16-bit ALU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
B.10 Source code of the 16-bit comparator . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
B.11 Source code of the 16-bit comparator contd . . . . . . . . . . . . . . . . . . . . . . . 69
B.12 Source code of the RAS block with set-exe . . . . . . . . . . . . . . . . . . . . . . . . 70
B.13 Source code of the RAS block with set-exe and set-seq . . . . . . . . . . . . . . . 71
B.14 Source code of the RAS block with set-seq . . . . . . . . . . . . . . . . . . . . . . . . 72
B.15 Source code of the RAS block with multiple set-seq bits and set-exe . . . . . 73
ix
LIST OF TABLES
4.1 Microinstruction execution times of an ingress microengine . . . . . . . . . . . 37
4.2 Microinstruction execution times for IP header processing microengine . . 38
4.3 Ingress microengine per packet execution time . . . . . . . . . . . . . . . . . . . . . 40
4.4 IP header processing microengine per packet execution time . . . . . . . . . . . 40
4.5 The number of datapaths executing in the IP header processing microengine 41
4.6 The number of datapaths executing in the ingress microengine . . . . . . . . . 42
4.7 Click element’s per packet execution time . . . . . . . . . . . . . . . . . . . . . . . . 43
A.1 Address register’s microcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
A.2 8-bit ALU’s microcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.3 Offset register’s microcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.4 Packet memory’s microcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A.5 Header register’s microcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A.6 16-bit comparatorA’s microcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A.7 16-bit comparatorB’s microcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
A.8 Flow-id register’s microcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.9 Send-to-fifo’s microcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.10 Global control microcode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.11 Microcode for IP microengine’s address and paint registers . . . . . . . . . . . 52
A.12 Microcode of IP microengine’s 8-bit ALU . . . . . . . . . . . . . . . . . . . . . . . . 53
A.13 Microcode of IP microengine’s packet store . . . . . . . . . . . . . . . . . . . . . . . 53
A.14 Microcode of IP microengine’s header register file . . . . . . . . . . . . . . . . . . 54
A.15 Microcode of IP microengine’s header register file contd. . . . . . . . . . . . . . 55
A.16 Microcode of IP microengine’s 16-bit ALU . . . . . . . . . . . . . . . . . . . . . . . 55
A.17 Microcode of IP microengine’s 16-bit comparator . . . . . . . . . . . . . . . . . . . 56
A.18 Microcode of IP microengine’s 16-bit temporary registers . . . . . . . . . . . . 56
A.19 Microcode of IP microengine’s CAMs and send-to-fifo datapaths . . . . . . . 57
A.20 Microcode of IP microengine’s flow-id register . . . . . . . . . . . . . . . . . . . . . 58
A.21 Microcode of IP microengine’s global control . . . . . . . . . . . . . . . . . . . . . . 58
xi
ACKNOWLEDGMENTS
The three years that I have spent as a masters student has been one of the most
rewarding experiences of my life. I not only furthered my technical skills but also got
an opportunity to enrich my personal life. I have realized the importance of patience,
intellectual humility and optimism during this course of time. My masters at Utah has
influenced me to pursue a doctorate in the department and has equipped me better for
future challenges.
I would like to sincerely thank my advisor Erik Brunvand for guiding me towards the
completion of this thesis. He has been very encouraging and patient with me besides
giving me a lot of independence. I am grateful to him for giving me opportunities
to attend many conferences. I would like to thank my committee members, Ganesh
Gopalakrishnan and Al Davis for their constructive and timely feedback on my proposal
and thesis drafts that has helped strengthen my thesis evaluation.
I am grateful to my colleagues: Vamshi, Himanshu and Gaurav for insightful techni-
cal discussions and help with CAD tool troubleshooting. Special thanks to my graduate
advisor Karen Feinauer for readily helping out with all the paper work and last minute
requests.
Thanks to my parents for always having faith in my abilities and being my role
models for pursuing a career in research and teaching. Thanks to all my friends (nu-
merous to name here) for their friendship, support and being my family during my stay
in Utah. Finally, special thanks to my life partner Shashi for all his love, wisdom and
encouragement that has helped me cope with difficult times.
CHAPTER 1
INTRODUCTION
Programmable controllers have gained a widespread popularity in varied processing
fields as they give the designers the advantages of allowing correction of errors in later
stages of design cycles, flexibility, easy upgrading of product families, meeting time to
market etc. without compromising much on performance. There are many examples
of synchronous programmable controllers like the ones in the FLASH processor[19],
S3MP processor[1] and many commercial ASICs (Application-Specific Integrated Cir-
cuit). Although traditional programmable controllers are synchronous, there is no reason
that such controllers could not also be designed in an asynchronous style. Asynchronous
implementation could add extra features that designers could exploit to improve their
systems. For example, asynchronous programmable control in the form of a micro-
programmed asynchronous controller (also known as asynchronous microengine[13, 14,
12]) provides modular and easily extensible datapath structures along with high perfor-
mance by exploiting concurrency between operations and employing efficient control
structures. These microengines are programmable at the fine granularity of microcode
and are close in performance to ASICs. Other than the modularity and flexibility ad-
vantages, microengines help exploit other aspects of asynchronous design style like low
power consumption and low EMI. Asynchronous microengines have been around for
quite some time but haven’t been actively pursued in realistic domains. The focus of this
thesis is to evaluate an asynchronous microengine based architecture in the domain of
network processing.
Traditionally, most networking functions above the physical layer have been im-
plemented by software running on general-purpose processors or there have been spe-
cialized ASICs for these tasks. ASICs provide an expensive solution and give a good
2
performance but lack flexibility and programmability whereas general-purpose proces-
sors provide programmability at the cost of performance. The bandwidth explosion in
the past couple of years has resulted in more bandwidth-hungry and computationally
intensive applications like VoIP (Voice over Internet Protocol), streaming audio and
video, P2P(Peer to Peer) applications etc. For networks to effectively handle these new
applications, they will need to support new protocols along with high throughput and
hence the need for flexible and modular network processing equipment. Initially, the
layer 2 and layer 3 processing was hard-wired but, after rapid changes in lower layer
protocols and higher layer’s applications, a more scalable solution in the form of network
processors[5, 26] has emerged. These network processing units have been optimized
for networking applications and combine the advantages of ASIC and general purpose
processors. There has been a wide range of programmable architectures proposed for
network processing. This thesis has proposed an asynchronous-microengine based cir-
cuit style which can also be one of the solutions as it fits into the domain of network
processing naturally. It does this by exploiting the asynchrony in network packet flow
along with a highly flexible, modular and extensible architecture. Although the present
network processor design approach is to balance programmability and performance, the
existing solutions are still limited by the memory bottleneck issues and are not dealing
with power consumption problem. This thesis does not deal with the memory bottleneck
but helps alleviate power consumption by providing the equivalent of clock gating at a
fine-grain level because of the asynchronous behavior of the circuits.
In this work, a case for asynchronous microengines in the domain of network pro-
cessing is presented. This circuit style has been evaluated by demonstrating fast-path
IP (Internet Protocol) routing as the packet processing application. Two types of mi-
croengines have been designed, namely ingress and IP-header processing microengines.
The ingress microengine does packet classification while the IP-header processing micro-
engine performs various functions on the IP header like TTL (time to live) decrement,
checksum check and computation, etc. Each microengine core is specialized enough
for different packet processing kernels yet, generic enough to handle newer protocols
3
and applications. The flexibility aspect of this design has been demonstrated by adding
firewalling functionality to the router by modifying the microcode.
These microengine cores can be used to replace RISC cores or can be used as co-
processors. The asynchronous design style makes it possible to have a large number of
these cores on a single chip without having to worry about clock interface issues. An
asynchronous core can also be easily integrated with a synchronous system using FIFO
interfaces.
To shorten the design cycle time, this design has been implemented on a Xilinx
xs2S150 SpartanII FPGA [32] board. Since a microprogrammable architecture’s real
benefit is to help designers re-program the ASIC by changing the microcode, an FPGA
implementation of a microprogrammable design has no advantage as it is already recon-
figurable. Thus, for most of the evaluation of this design approach, an extrapolated ASIC
version has been used. This architecture’s per-packet computation performance has been
compared with the Click software router[18].
This thesis
• Demonstrates that the asynchronous microengines are well-suited to network-processing
• Implements a modular IP router that does minimal data-plane processing and pro-
totype it on a Xilinx FPGA board
• Evaluates its performance by measuring the execution times for various packet
types and comparing it to a similar Click Router configuration
• Extrapolates the performance to best guess ASIC
CHAPTER 2
BACKGROUND
In this chapter, the asynchronous design [6, 21] approach is briefly introduced, the
motivation behind using this approach is discussed along with the asynchronous micro-
engine architecture, routing concepts and the Click software router [18].
2.1 Asynchronous Design
Asynchronous, or self-timed, systems are those that are not subject to a global syn-
chronizing clock. Since asynchronous circuits have no global clock, they must use some
other means to enforce sequencing of circuit activities. These sequencing techniques
can range from using a locally generated, clock-like signal in each submodule to em-
ploying handshake signals both to initiate (request) and detect completion of submodule
activities (acknowledge), and many variations in between. Circuits that use handshakes
to sequence operations are known as self-timed circuits [25] and allow control to be
distributed throughout the circuit instead of centralized in a controller. For asynchronous
control, the two dominant handshaking protocols are:
two-phase (transition) signaling in which each transition on REQ or ACK signals rep-
resents an event.
four-phase (level) signaling in which only positive-going transition on REQ or ACK
signals initiates an event and each signal must be “returned to zero” before the
handshake cycle is completed.
A self-timed system thus, consists of self-timed modules that are communicating with
each other in parallel or in sequence using these handshake protocols. In this imple-
mentation, the four-phase handshaking protocol has been used that is very similar to
5
synchronous design style so that the synchronous memory modules available in the
Xilinx library can be utilized easily.
2.2 Advantages of Asynchronous DesignIn the network processing domain, asynchronous systems have a number of com-
pelling advantages over their synchronous counterparts, including:
Timing Self-timed circuits separate timing from functionality. Rather than synchronize
the entire system to a global clock signal, self-timed circuits localize timing in-
formation to individual circuits and avoid the problems of clock distribution and
clock skew endemic in synchronous designs.
Composability Systems may be constructed by connecting components and assembling
subsystems based only on their functionality rather than having to consider their
timing characteristics. Each subsystem can be designed and tested independently
with confidence that, because a self-timed communication protocol is used, they
will operate correctly when assembled into a larger system. This enables us to use
network processing microengines along with other processors on SOC (System on
a chip).
Lower Power Dissipation As clock speed and system size increase, the portion of the
system’s power budget dedicated to distributing the clock increases dramatically.
Self-timed systems do not incur the power overhead of distributing a free running
clock across the entire system. Because self-timed systems make signal transitions
only when actually doing work or communicating, large systems can show greatly
decreased power dissipation in some technologies (like CMOS), especially during
quiescence. Due to reactive reactive nature of network packets, we can have a
natural power down during slack times.
Lower Electromagnetic Interference (EMI) A free running clock signal charges and
discharges large capacitances (the clock tree is typically the largest and most capac-
itive net on a chip) at regular intervals. This is a source of a great deal of EMI and
can have serious consequences, especially for circuits that have RF components.
6
Self-timed systems may make just as many transitions as a clocked system, but
those transitions are uncorrelated in time so they do not lead to the bad EMI
properties of synchronous circuits. Also, local handshake wires tend to have lower
capacitance than a global clock line. This can be an advantage as the network
routers can be in proximity to other RF type devices.
Incremental Improvement In a properly designed asynchronous system it is possible
to improve the performance or functionality of a system by improving or replac-
ing individual subsystems incrementally without changing or re-timing the whole
system. This can be an advantage by allowing plug-in modules for newer versions
of protocols.
Performance Traditional synchronous systems usually exhibit worst case behavior while
asynchronous systems tend to reflect the average case. This difference can result in
large performance increases for some systems. Since most of the network packets
are average cases, we can get higher performance.
2.3 Asynchronous Microengine
An asynchronous microengine is a microprogrammed self-timed controller which
allows per-microinstruction programmability of its datapath topology by arranging its
datapaths in series and parallel clusters. This feature allows the parallel clusters to run
concurrently while allowing the serial units within a cluster to chain. This chaining
of operations is possible due to explicit acknowledge generation by each asynchronous
datapath in the sequence. Chaining reduces the control overhead as it reduces the number
of microinstructions by combining several microinstructions into a single VLIW type
instruction. This is very hard to implement for synchronous designs as the propagation
delay of serial partitions of combinational modules must add up to be an integral multiple
of clock period.
7
2.3.1 Architecture Overview
A conventional (synchronously clocked) microprogrammed control structure consists
of a microprogram store, next address logic, and a datapath. Microinstructions form
commands applied on the datapath and control flow is handled by the next address logic
that, with the help of status signals fed back from the datapath, generates the address
of the next microinstruction to be executed. In a synchronous realization the execution
rate is set by the global clock which must take the worst case delay of all units into
account. When the next clock edge arrives it is thus assumed that the datapath has
finished computing and the next address has been resolved, and the next microinstruction
can be propagated to the datapath. The asynchronous microengines have an organization
similar to those of conventional synchronous microprogrammed controllers. However,
as shown in Figure 2.1, the major difference between the synchronous microprogrammed
structure and an asynchronous microprogrammed structure is the request-acknowledge
handshake between the datapaths and the execution control unit instead of being driven
in lock-step by a global clock.
Figure 2.1. Microengine’s high level structure
8
In conventional synchronous microprogrammed controllers, the computation is started
by an arriving clock edge and the datapath is assumed to have completed by the following
clock edge. In the asynchronous case we have no clock to govern the start and end of an
instruction execution. Instead a request is generated to trigger the memory to latch the
new microinstruction and the datapath units to start executing. The memory and each
datapath unit then signals their completion by generating an acknowledge.
2.3.2 Microengine Execution
A microengine starts its execution on receiving an external request from the environ-
ment. The execution control unit in response to this request generates a global request
that latches the first or the specified microinstruction from the microprogram and also
causes the datapaths that are set up for execution to start executing. The microcode
consists of fields that specify which datapath has been setup for execution and in what
mode, i.e. ,sequential or parallel with respect to other datapaths, and multiplexer select
signals which specify the input data and output data for that particular microinstruction.
On completion of computation, the datapaths generate acknowledges which are sent
back to the execution control unit. Meanwhile, the branch detect unit evaluates the
conditional signals generated by the datapath and determines the address of the next
microinstruction to be executed. After gathering the acknowledges, the execution control
unit checks to see if the done bit (one of the global control fields in the microcode) is
high which specifies that the operation desired by the environment is completed and
sends an external acknowledge signal to the environment and awaits the next invocation,
otherwise it executes the next microinstruction and repeats the global request-generation
cycle.
2.3.3 Next Address Generation
The next microinstruction is fetched in parallel to the execution of the current mi-
croinstruction to increase performance and hide control overhead. This approach works
smoothly except when branches are involved. The problem of branch-prediction is solved
by fetching the next microinstruction but not committing it until the address selection is
9
resolved. The designer programs each branch to be taken or not taken on studying the
probability for each case using empirical data. The next-address logic is kept simple by
storing the next address in the case of a branch instruction in the microinstruction itself
and a delay slot is used to store the next microinstruction supposedly to branch to or the
next sequential microinstruction depending on the predicted branch selection strategy.
2.3.4 Microcode Fields
Figure 2.2 shows the various fields of a microinstruction (all fields are not shown
in this figure). The microcode fields control the local operation mode of each datapath
and control the global microprogram flow. Amongst the local control fields, the set-exe
or se bit controls whether the particular datapath unit will execute during the current
request cycle or not and the set-seq or ss bit controls whether it will execute in sequential
(chained) mode or parallel mode. If a datapath always executes in the sequential mode,
then the se can be used to incorporate ss functionality. The set-mux and op fields deter-
mine which operands and operation the given datapath should use. The en bits enable
which registers when there are multiple registers in a datapath, should latch data.
The following global control fields control the flow of the microprogram. The set-
branch bit specifies whether this microinstruction is a branch instruction or not and on
Figure 2.2. Microinstruction format
10
which conditional expression, the branch should be tested. The next-addr field specifies
the branch-address. The bra-pred bit specifies the branch-prediction strategy for the
branch instruction i.e whether this branch has been predicted to be taken or not. The eval
bit specifies whether it’s a conditional branch or a jump instruction. The sel-addr bit
selects either next sequential microinstruction or the branch instruction. If the microin-
struction is not a branch instruction then next-addr, bra-pred, eval and sel-addr bits are
dont-cares. Finally, the done bit specifies whether it is the last microinstruction or not
and signals the execution control unit that the microprogram computation is over and the
results are available on the output ports.
For example in the Figure 2.2, there are two datapaths A and B whose set- signals
are shown. In the first microinstruction, only A has been set to execute whereas in the
second microinstruction, both A and B execute and both operate in parallel mode. In the
microinstruction, both A and B execute but B executes in sequential mode with respect
to A. This microinstruction is also a branch instruction where the results of datapath A
evaluate the branch condition and this branch has been predicted to be not taken.
2.3.5 Control Units
The global control units for a microengine are:
1. ECU: The Execution Control Unit generates the global request signal apon receiv-
ing an external request. It then synchronizes all datapath acknowledges. It keeps
repeating the generation of request signal until it sees a high done signal and then
generates an external acknowledge. The block level implementation figure for an
ECU is shown in Figure 2.3.
2. Next Address Unit: This control unit is used to compute the address of the microin-
struction predicted to execute next at the start of next execution cycle. Figure 2.4
shows the block level implementation of this unit. The current value of the sel-addr
bit is the conditional input to the mux for selecting the next address. In the case
of a branch mispredict, BDU asserts the clear signal which then toggles the value
of sel-addr bit from the memory and the current microinstruction is latched in the
next execution cycle.
12
3. BDU: The Branch Detect Unit determines if the conditional branch is taken or not
and asserts the clear signal in the case of a mispredict. Its gate level implementa-
tion is shown in Figure 2.5.
The datapaths interact with their local handshake control unit also known as RAS
(Request/Acknowledge/Sequence) blocks. These small blocks support a standardized
way of programming the datapath topology based on current-microinstruction’s se or ss
bits. The datapath units themselves then communicate with their local RAS block by
using standard request/acknowledge protocols. This also makes the datapath modular
that means datapath units can be easily replaced without changing any control structures.
Depending upon the datapath, the design of RAS blocks can vary. For example, if a
datapath is always operates as the first one in a chain, then it would not have a set-
sequence bit associated with it. The RAS block for such a datapath is as shown in the
Figure 2.6. The complete set of RAS blocks used in this design is shown in Appendix B.
2.4 Internet Protocol(IP)
The Internet Protocol) [29, 22, 15] provides every endpoint with an IP address and
it forwards the IP packets from a source to its destination based on the IP address in the
packet header. When forwarding packets, IP hides the details of link layer technologies
from the endpoints and provides the abstraction of an unreliable, best-effort, end-to-end
link. In this thesis, we are looking at IP over ethernet layer.
Figure 2.5. Branch detect unit
13
Figure 2.6. Local RAS block
2.4.1 IP Header
As seen from the Figure 2.7, the IP header without options consists of 20 bytes
and with options can be an additional 40 bytes. The various fields of an IP header are
explained below:
4-bit Header Length This field specifies the length of the header in 4-byte words and
automatically translates to a limit of 60 bytes. The header length field allows a
router or host to distinguish between header and payload.
Figure 2.7. IP header format
14
4-bit Version The current IP version is 4 (a.k.a IPv4). This field allows the upgraded
version 6 (IPv6) to coexist with IPv4.
Type of Service(TOS) This field is composed of a 3-bit precedence field(which is ig-
nored today), 4 TOS bits and an unused bit which always 0. The four TOS bits
are: minimize delay, maximize throughput, maximize reliability, and minimize
monetary cost. Only one of these four bits can be turned on. The TOS feature is
not supported by most TCP/IP implementations today.
Total Length This field specifies the total length of an IP datagram in bytes. Since it is
16-bits long, the longest IP packet is 65,535 bytes long.
Identification It uniquely identifies an IP packet among those that have a given source
and destination address.
3-bit Flags This field supports fragmentation. Only 2 out of 3 bits are significant. One
of the flags is “More fragments”. If this flag is set to 1 then the router or host
knows that there are more fragments arriving before it does a reassembly and it
waits untill it receives a datagram with this flag set to 0. The other flag is “Don’t
fragment”. If this flag is set, the router which gets a packet whose size is too large
for its next hop must discard it.
Fragment offset This field also supports fragmentation and reassembly. It tells the
destination which part of the original packet is contained in the current fragment.
Time-To-Live This field gets decremented at every hop and thus helps to control the
lifetime of a packet. If it becomes zero, then the packet is discarded.
Protocol This field tells IP to which upper layer protocol it should pass the packet. It is
typically TCP or UDP.
Header Checksum This field protects against the corruption in the packet header. IP
does not checksum data. To compute the checksum for an outgoing packet, the
value of the field is set to zero. Then the 16-bit one’s complement sum of the
15
entire header is calculated and this value is stored in the checksum field. When
an IP datagram is received, the 16-bit one’s complement sum of header is calcu-
lated. Since the receiver’s calculated checksum contains the checksum stored by
the sender, the receiver’s checksum is all one bits if nothing in the header was
modified. If the result is not all one bits (checksum error), IP discards the packet.
32-bit Source IP address This field is used to send a reply to the source of the packet
such as when any error is generated.
32-bit Destination IP address This field is used to route the packet and forward it to its
destination.
Options These options are not universally implemented in the Internet and are not a
common case. To save packet processing time, many high speed routers do not
process options. In this thesis, we are not processing options.
2.4.2 IP Over Ethernet Layer
Ethernet [29] is the most commonly used link layer protocol for LANs and is fre-
quently used to support a range of network layer protocols, including IP. The IP data-
grams are transmitted by encapsulation in Medium Access Control (MAC) frames. IP
introduces an extra protocol, known as the address resolution protocol (ARP) to map
between the destination hardware 48-bit address in a MAC frame and a 32-bit IP address.
In this thesis, we classify the packets at the ethernet layer as ARP query, ARP reply and
IP packet but do not process the ARP packets. As shown in Figure 2.8, the type field is
used for this classification.
Figure 2.8. Ethernet frame format
16
2.4.3 IP Router
A router [8, 16] is defined as a host that has an interface on more than one network.
Every router along the path has a routing table with at least two fields: a network number
and the interface on which to send packets with that network number. When a router
receives a datagram, it looks up the routing table to determine its next-hop address and
forwards it to the outgoing port. If the destination address is unknown, then it sends
the packet to the default route. IP routing consists of data-plane processing and control-
plane processing. Data-plane processing comprises of packet processing tasks such as
ttl decrement, checksum check and checking other fields in the IP header such as header
length etc. and datagram forwarding by doing a route lookup. Control-plane processing
consists of generating ICMP error messages and handling of routing protocols which
decide which path is best to reach a destination and corresponding routing table updates.
In this thesis, an IP router has been implemented which does data-plane processing and
does not handle any control-plane processing.
2.5 Click Modular Router
Click is a software architecture for building flexible and configurable routers that
was developed at MIT[17]. Applications in Click are built by composing modules called
elements which perform simple packet-processing tasks like classification, route-lookup,
header verification, queuing, scheduling and interfacing with network devices. A click
configuration is a directed graph with elements as vertices; packets flow along the edges
of the graph. Click is implemented on Linux using C++ classes to define elements. Ele-
ment communication is implemented with virtual function calls to neighboring elements
and connections are represented as pointers to element objects. These configurations are
modular and easy to extend.
This thesis has looked into the Click IP router configurations given in [18]. A stan-
dard IPv4 over ethernet bridge router with two network interfaces has sixteen elements
in its forwarding path. Figure 2.9 shows an IP router’s configuration.
18
This router can be extended to support firewalls, dropping policies, differentiated
services and other extensions by simply adding a couple of elements at the respective
places. The router implementation in this thesis is based on the above Click configuration
and has used it to model every module’s functionality by understanding the fine-grained
packet-processing C++ description of each element on hardware. Since the microengine-
based architecture needs modular datapaths, Click’s modular and extensible router suits
the microengine architecture style naturally and it too can be extended by adding more
datapaths and modifying the microcode.
The Click IP router configuration has been described by explaining the functionality
of each element in the forwarding path in the context of this application briefly below:
1. FromDevice(eth0)- It receives packets from the ethernet port and sends them across
its single output port.
2. Classifier(set of rules)- It takes the packet and classifies it as an ARP query, ARP
reply, or an IP packet and sends it to the respective output ports.
3. Paint(p)- It sets the paint annotation of the packet as the port number p if its an IP
packet. This helps in sending an ICMP redirect message if this packet is sent back
via the same port it arrived on.
4. ARPResponder(ip eth)- Input takes ARP queries and outputs ARP responses. It
responds to ARP queries for IP address ip with the static ethernet address eth.
5. ARPQuerier- Its first input takes an IP packet which is to be encapsulated with
an ethernet header, finds the destination ethernet address for the corresponding
next-hop destination IP address and encapsulates that IP packet. Its second input
takes ARP responses with ethernet headers. Its first output is encapsulated IP in
ethernet packet and second output is an ARP-query if a particular ethernet address
corresponding to some IP-address isnt known.
6. Strip(14)- It strips the ethernet header off the IP-packet and sends it to the output
port.
19
7. CheckIPheader()- It takes IP packets as inputs; discards packets with invalid IP
length, source address or checksum fields and forwards valid packets unchanged.
8. GetIPAddress()- Input takes IP packets and copies the IP header’s destination ad-
dress field into the destination IP address variable for that packet; forwards packets
unchanged.
9. LookupIPRoute()- Input takes IP packets with valid destination address variables.
It has arbitrary number of outputs. It looks up input packet’s destination address
variable in a static routing table and forwards each packet to the outport port
specified in the resulting routing table entry;sets its destination address variable
to the resulting gateway address, if any.
10. Dropbroadcasts()- Input takes any packet and discards packets that arrived as link-
level broadcasts and forwards others unchanged.
11. Checkpaint(p)- Input takes any packet and forwards the packet with paint variable
p to both outport ports else only to the first output. The second output port sends
an ICMP redirect message.
12. ICMPError(ip, type, code)- Input takes IP packets, output emits ICMP error pack-
ets. It encapsulates first part of input packet in ICMP errorheader with source
address ip, error type type, and error code code and sets the fix IP Source variable
for this packet.
13. IPGWOptions()- Input takes IP packets. This element processes IP record route
and timestamp options;packets with invalid options are sent to second output.
14. FixIPSrc(ip)- Input takes IP packets and sets the IP header’s source address field to
the static IP address ip if the packet’s fix source variable is set and forwards other
packets unchanged.
15. DecIPTTL- Input takes IP packets and decrements input packet’s IP time to live
field. If the packet is still live, it incrementally updates the checksum and sends the
20
modified packet to the first output; if it has expired, it sends the unmodified packet
to second output.
16. IPFragmenter(MTU)- Input takes IP packets. It fragments IP packets larger than
MTU;sends fragments, and packets smaller than MTU to first output. Too large
packets with dont-fragment-bit set high are sent to the second output.
17. Todevice(eth0)- Input takes ethernet packets for transmission; no outputs.
2.6 Related Work
2.6.1 Programmable Asynchronous Controllers
Programmable asynchronous controllers were looked at in the 1980s [28] in the
context of data-driven machines. They used a vertical microcode for their microse-
quencer which was used to drive multiple slave controllers along with structured tiling
which introduced considerable control-overhead. It was also not an application-specific
controller. In 1997, Jacobson and Gopalakrishnan at Utah looked into the design of
efficient application-specific asynchronous microengines [13, 14, 12]. Their architec-
ture uses a horizontal microcode that allows per-microinstruction programmability of its
datapath topology by arranging its datapath units into series-parallel clusters for each
microinstruction.
2.6.2 Network Processors
Many industries have ventured into Network Processor design and hence, there is
a wide variety of architectures available in the market. However, all designs have one
key point in common: they use multiple programmable processing cores or engines
(PPE) in a single chip. For example, Intel IXP1200 consists of six microengines1 on
a single die but the amount of functionality in these cores varies from vendor to ven-
dor. Some use RISC cores with added bit-manipulation instruction-set also known as
ASIP(Application-Specific Instruction-set Processor), whereas others use a VLIW-based
architecture. In these architectures, multiple PPEs can process in parallel, pipeline or
1The similarity in the name of IXP “microengine” and asynchronous microengines is coincidental
21
combination of both styles depending on the application. However, a RISC-based archi-
tecture is more flexible and easy to configure compared to VLIW(Very Large Instruction
Word)-based architecture. Many RISC-based architectures use multithreading in each
core to maximize throughput and do useful work while there is a wait for the operation
to complete. This is particularly useful to hide memory access latency. Network pro-
cessors that are RISC-based provide dedicated and specialized hardware or integrated
co-processors to perform common network processing tasks like encryption, lookup,
classification, CRC computation, etc.
CHAPTER 3
DESIGN AND IMPLEMENTATION
3.1 Router Architecture
The architecture for the IP router consists of two types of microengines, Ingress
processing microengine, which does packet classification, and IP Header processing
microengine, which does minimal processing on IP header like route-lookup, checksum
computation, etc. The high-level architecture block diagram is shown in Figure 3.1.
There are two ingress microengines corresponding to each port A and B. Each ingress
microengine has one input FIFO queue and three output FIFO queues respectively. The
input queue contains each incoming packet’s id and its ethernet header’s memory ad-
dress. The ingress microengine upon receiving a request from input FIFO, starts execut-
ing and classifies each packet as an ARP reply, ARP query or an IP packet and sends
the packet to the respective output FIFO. Since there are two FIFOs which contain IP
packets from each ingress microengine and a single IP header processing microengine,
there needs to be a synchronizing element which merges the two IP FIFOs into a single
input FIFO queue for the other microengine. We have implemented an arbiter module
which does this synchronization. The IP header processing microengine consists of a
merged input FIFO and three output FIFO queues namely discard, portA and portB.
A packet is sent to a discard FIFO only if it meets any of the criteria like the expired
TTL, etc. A packet is routed to the portA or portB fifo depending upon the destination
address’s next hop address.
3.1.1 Assumptions
The focus of this thesis has been on the asynchronous microcoded style to build a
network processing application rather than on the high performance IP routers. Since IP
routing was the chosen application, the emphasis was on implementing the microengine
23
Ingress Microengine
Input Q port B
Port A Port B
IP Q IP Q
Merged IP input Q
Arbiter
IP Header ProcessingMicroengine
Ingress Microengine
Discard Q
Arp QueryOutput Q
Arp ReplyOutput Q
Arp Reply Arp QueryOutput Q Output Q
Input Q port A
Port B output QPort A output Q
Figure 3.1. High level architecture of the microengine-based router
based controller rather than the supporting circuitry. Thus, many assumptions have been
made with regard to the supporting circuits needed to realize a complete working model.
1. It has been assumed that a NIC exists that does all the link-layer processing like
encapsulating and decapsulating IP packet with ethernet header.
2. It has been assumed that a port processor exists that handles the memory manage-
ment and performs the following functions:
• Buffers incoming packets into the memory.
• Buffers payload in SDRAM; IP and ethernet layer header in SRAM.
• Maintains the state of each packet by assigning each packet-buffer a packet-
id.
• Passes the header-pointer and packet-id to the ingress microengine.
24
• Receives the outgoing packet’s id and pointer from the output queue of IP
header processing microengine and sends that packet to the corresponding
output port.
3. The microengine-based router does not handle headers with IP options and frag-
mented datagrams as routers connected to the end hosts only need to do this
processing.
4. The packets that need control-plane processing are handed over to some other
processor core that is a reasonable assumption as the percentage of control packets
in the real network traffic is very small and even in the case of the Intel IXP,
the strongARM core is responsible for control-plane processing while the RISC
microengine cores handle data-plane processing.
5. This router does not handle ICMP error generation as compared to the click router
configuration as ICMP is also a part of control-plane processing. The packets that
would generate an ICMP error are discarded into a discard FIFO queue.
6. CAM (Content Addressable Memory) has been used to do route lookup. The first
byte of the destination IP address is used to find a match instead of longest prefix
matching algorithm. This optimization has been done to save on limited on-board
Xilinx SRAM.
7. Each microengine has its own copy of header store. This has been assumed as
Xilinx SpartanII library does not have SRAM with four read ports and one write
port. ASIC vendors like Texas Instruments sell such five port register files. Since
each microengine will be reading or writing a different packet, it is ensured that
none of the microengines will try to read/write into the same memory location.
3.2 Ingress Microengine
Ingress microengine classifies the packets into three types namely, ARP Query, ARP
Reply and an IP packet. The block diagram for an ingress microengine is shown in the
Figure 3.2.
25
3.2.1 Datapaths
This microengine consists of the following datapaths:
1. 8-bit Address register, that stores the packet’s memory address.
2. 8-bit ALU, which calculates the offset address for a particular header field and
header memory.
3. 16-bit Header register, that is byte addressable.
4. 16-bit comparators
5. 2-bit Flow-id register, which stores the packet’s classification result
6. Send-to-fifo datapath that enables one of the three output FIFOs by sending them
a request signal based on the flow-id’s value. The block diagram of a send-to-fifo
datapath is shown in Figure 3.3.
7. Header store stores all packet headers.
3.2.2 Microprogram Structure
The microprogram for this microengine consists of four 67-bit wide microinstruc-
tions. Each microinstruction consists of the global control fields such as the branch
address next-addr, done etc. and local control fields for each datapath unit such as
the set-execute se etc. Since flexibility is one of the most important advantages of our
architecture, it is demonstrated by incorporating classification rules in the microcode
itself as this allows for easy upgrading of these rules without needing to change the
underlying hardware. The header bytes that need to be compared with some value can
be specified in the microcode itself along with the rule value.
26
DONE
Microcode Store
Flow−id Reg Send−to−fifo
Packet headerMemory
16−bit Header Register
ECU
BDU
Next−Add
Ack1
Ack4
sreq4
sreq5
sreq4
Header
cmpA flow−id
R A R A R A R A
R R A R A R A
cmpAcmpB
eval bra−pred
Clear
Ext−req
Ext−ack
REQ
sreq1 sreq2 sreq3 sreq4
REQ
sreq7
16−bit 16−bit
Register 8−bit ALU
Ack3
GLOBAL
Ack4
Microcode(66:0)
Header
GLOBAL
Ack2sreq6
A
REQ
sel−addr Branch
Offset Address 32−bit Header
RAS
RAS RAS RAS RAS
GLOBAL
RAS
Comparator A
Ack1
cmpBcmpA
McodeMcode
16−bit
Req to out Fifo
Comparator B
Mcodefrom input fifoAddressPacket
8−bit Address
Mcode
16−bitHeader16−bit
RAS RAS
Add
add
Ack from out Fifo
Out Fifo
Figure 3.2. Block diagram of ingress microengine
27
Figure 3.3. Send-to-fifo datapath
Thus, the microcode for an ingress microengine also consists of the following fields:
• Header byte offset address
• Classification rule value (for example, ethernet header “type” field should be 0800x
for an IP packet)
• Flow id which enables the respective output queue (00 for IP, 10 for ARP reply
and 11 for ARP query)
• Mux select signals for byte-addressable header register
3.2.3 Operation Overview
The input fifo queue upon receiving a packet, sends a request to the ingress micro-
engine which, if in an idle mode, starts executing by propagating a global request signal
and latching the first microinstruction mi-1. In mi-1, the ethernet header “type” field is
28
read out from the memory and then the two comparators check in parallel if the type
is an IP (0800x) or ARP (0806x). If it is an IP packet, then the flow-id “00” from the
microcode gets selected and latched in the flow-id register and the microengine fetches
the next sequential instruction mi-2. If it is an ARP packet, then the flow-id “01” gets
selected and the microengine takes the branch and fetches mi-3. If mi-2 gets executed
then, send-to-fifo datapath is enabled and it sends the packet out to the IP output FIFO
and the microengine jumps to mi-4 which is the done instruction. If mi-3 gets executed
then the ARP “type” field is read from the memory and the two comparators again
compute in parallel to check if it is an ARP reply or query. Depending upon the flow-id
chosen by comparators results, send-to-fifo datapath sends the ARP packet to the reply
or query output FIFOs and fetch the done instruction mi-4. The done instruction sets the
done bit high, upon which the ingress microengine’s ECU sends an ACK to the input
FIFO and waits for the next packet to arrive on this port. In this microengine, most of the
datapath operations are chained except for the parallel comparator evaluation.
3.3 Header-processing Microengine
This microengine buffers the IP-header from the memory into a header register file
and processes it. IP header is then checked if it is valid, i.e, if it has valid “version”
and header length “hlen” fields. The checksum of the incoming packet’s IP header is
computed and then compared it against the “checksum” field. IP header is also checked
for an expired “ttl” field. If the header is valid then the “ttl” field is decremented. The
next hop destination IP address is looked up and the checksum is recomputed and it’s
latest value is written to the checksum field and the packet is sent to the respective port.
If the header was invalid then the packet is discarded by sending it to a discard FIFO.
3.3.1 Datapaths
The block diagram of a IP header processing microengine is shown in Figure 3.4. It
consists mostly of generic data-paths except for CAMs that are specialized for network
processing tasks.
29
sel−addr
Microcode StoreECU
BDU
Next−Add
Add
Clear
Ext−req
Ext−ack
R A
8−bit
R A R A R AR A
16−bit
R A
ONSTANTS
AR
R A R A A R AR
CAMFilter
CAM
lookupRoute
Flow−idReg
Send toFifo
Req to out Fifo
Ack from out Fifo
eval
FileRegHeader
flow−id16−bitTemp
Regs
cmp
Match1cmp
Match2
sreq1 sreq2 sreq4 sreq5
sreq6 sreq7 sreq8sreq9 sreq10
Ack2 Ack3
Ack4
Ack1
Ack5 Ack6 Ack7
REQ
8−bit Header 8−bit Header16−bitALUResult
tempAtempB
DONE
RAS RAS
Match1
Match1
RAS RAS
REQGLOBAL
REQGLOBAL
Microcode(110:0)
sreq9
sreq3
bra−pred
16−bit headerPacket
Memory Comp
C
Ack8
Ack1
Ack8
8−bit ALU
Wport
Rport
Mcode
GLOBAL
McodeAddReg
Packet Add
from input fifo
Mcode
Offset Add
addBranch
sreq7
OutFifo
tempA
tempB16−bitHeader
16−bit
ALU
sreq4 RAS
16−bit
Result
sreq6
Mcode
RAS RAS RAS RAS RAS
RAS
cmp
32−bit Header
Figure 3.4. Block diagram of IP header processing microengine
30
The generic datapaths in this microengine are similar to ingress microengine and
include the 8-bit address register, header memory, 16-bit comparator, 16-bit temporary
registers, 8-bit ALU, 16-bit ALU, flow-id and send-to-fifo datapaths. Amongst the
specialized datapaths, we have one CAM that does route lookup and another one that
implements a stateless firewall. Modularity of datapaths in this architecture has been
demonstrated by adding a filtering (based on source IP address) CAM in our design. The
filtering extension can be disabled by changing the set-exe bit of this datapath in the mi-
crocode to not execute. Since we are the entire IP header (20 bytes without IP options) is
stored in this microengine, the register file has been implemented as a 4-byte wide and 16
word deep dual port SRAM which has one synchronous write port and an asynchronous
read port. This helps in parallelizing the writing of new header bytes into the memory
and reading of already stored bytes for doing checks and checksum computation. It
can be seen that except for CAMs and dual port register file, most of the datapaths are
generic and common to both ingress and IP header processing microengines. Depending
on the design requirement (if chip area is not a constraint), the ingress microengine can
be implemented by changing the IP header processing microengine’s microcode.
3.3.2 Microprogram Structure
There are fifteen 111-bit wide microinstructions for this microengine. The global
and local control microcode fields are same as described earlier. This microprogram has
microcode fields similar to the ingress microengine’s like the flow-id values, read and
write addresses for header bytes, etc. Since there are many constant values that are being
compared to some header bytes like the version field, hlen field etc., these constants
have not been incorporated in the microcode and are being stored in constant registers.
However, if there is a need for more flexibility due to changes in protocol, these constants
can be added as microcode fields.
3.3.3 Operation Overview
The microengine receives an IP packet from the input FIFO and begins its execution
and after it finishes processing, it sends out the packet header address into one of the three
31
output FIFOs (portA, portB and discard). The entire IP header processing algorithm has
been implemented in fifteen microinstructions and there may be opportunities for further
optimizations. Out of fifteen microinstructions, nine microinstructions have two parallel
execution clusters and one microinstruction has three parallel execution clusters. The
first microinstruction mi-1 latches the packet’s memory address in the address register
and then computes the offset address and reads the four header bytes from the packet
memory and writes them to the dual port header register, followed by checking the first
byte to be valid (i.e checking for “version” and header length “hlen” fields). If the check
evaluates to true then the control jumps to mi-3 otherwise the next sequential mi-2 which
is the packet discard microinstruction and sends the packet’s address to discard FIFO and
jumps to mi-15(done) microinstruction. There are three microinstructions in which if any
of the checks fail then the control jumps to the discard microinstruction (mi-2). One of
the interesting features of in this algorithm is the way the checksum of an incoming
packet is being computed. After the first four bytes are written to the register file, the
two 16-bit consecutive header bytes (1,2 and 3,4) are added and stored in a temporary
register. From this point onwards, two header bytes from asynchronous read port are read
and then added to the value stored in temporary register in parallel with writing of header
bytes on the write port. This works as it is ensured that the read and write addresses are
not same. The temporary register keeps getting updated. The 16-bit sum for the entire IP
header except for checksum and ttl fields is stored in a separate temporary register. This
value can later be used to calculate the checksum of the outgoing packet by just adding
the latest ttl field value to it. The writing of latest ttl and checksum values to the header
memory in two microinstructions has also been handled in this implementation.
3.4 Design Methodology
The asynchronous microengine based router architecture has been prototyped on
the Xilinx SpartanII XS2S150 board made by Xess. The design has been validated
with respect to functionality and complete back-annotated timing analysis has been per-
formed. Xilinx ISE tool-suite has been used for synthesis, placement and routing and
MTI Modelsim for simulation. The 3D tool[34] was initially used for the synthesis of
32
execution control unit but it did not give a correct specification and the circuit was then
modified by hand. The block diagram for the ingress microengine is shown in Figure 2.3.
A mixed design flow for the prototyping of this design has been used. Microcode stores
and CAMs have been implemented using Xilinx’s IP cores using Logicore tool. A few
of the datapaths and control units have been designed using a two-phase macromodular
approach while the rest of the datapaths have been specified in behavioral VHDL. The
final assembly has been done in the ISE ECS schematic capture environment. The
abovementioned various design flows have been explained in the following subsections.
3.4.1 Macro-modular Design Approach
Self-timed macro-modules have been used for designing a few of the control and
datapath elements [3]. To accomplish this, a subset of two-phase control modules for
the Xilinx FPGA were designed which consisted of the following: C-element, transition
latch, select, q-select, toggle and two-way call element. The C-element was coded in
behavioral VHDL whereas the rest of the elements were designed using xilinx compo-
nents in the ISE ECS schematic editor. These macromodules have been used to build
self-timed flow-through FIFOs [30]. Since this design uses a four-phase handshaking
protocol, protocol converters have been used to interface these macro-module based
two-phase designs with the four-phase implementation. The Send-to-fifo datapath has
been designed using these modules(Select). Our arbiter circuit is based on a q-select ring
alt construct which also uses these modules as shown in Figure 3.5.
3.4.2 VHDL-based Design Approach
Most of the datapaths such as the ALUs, comparators, registers and control units such
as all the RAS blocks have been specified using behavioral VHDL. These VHDL specifi-
cations have been verified functionally using Modelsim simulator. Most of these modules
are not optimized for high-performance. Various interesting high-speed circuit-design
styles can offer a much better performance if this design was not being implemented on
an FPGA.
33
Figure 3.5. Alt-ring conditional construct
3.4.3 Memory Design Using Xilinx Core Generator
The microcode stores, header store, dual-port header register and CAMs [33] have
been designed using Xilinx cores. These cores are optimized for high-performance and
can be easily integrated into a VHDL, Verilog or a schematic-based design flow. These
cores give users the flexibility to define features such as the memory depth, width, option
of RAM or ROM and the type of on-chip memory resources to be used (distributed
SelectRAM or blockRAM) etc. The Xilinx Logicore tool (Core generator) helps specify
the initial contents of memory using a memory editor tool. The memory editor tool
allows a user to specify the initial contents in an array format. This is of immense utility
when compared to specifying the contents using the INIT attributes in terms of ease and
flexibility. For a microcoded framework, where the low-level microcode can be error
prone, such a tool helps reduce the design time.
3.4.4 Bundled Delay Vs Completion Detection
All datapaths except send-to-fifo datapath use a bundled delay to meet the bundling
constraint instead of completion detection circuitry. To measure the bundled delay re-
quirement, the worst-case execution time of the datapaths had to be kept into account.
34
These numbers were taken from the postroute timing information obtained after the
datapaths are synthesized, mapped and routed using the ISE tool. The use of bundled
delay for completion sensing takes away the asynchronous advantage of average case
completion time but reduces the scope of this thesis in the actual datapath design and
helps in faster prototyping using VHDL. The bundled delay has been implemented using
buffer chains. One of the problems initially encountered using this method was that the
FPGA synthesizer was optimizing away all buffers as redundant logic. This problem was
solved by using Keep attributes on all the signals associated with these buffers.
3.5 FPGA Resource Statistics
This design was too large to fit on the existing Xilinx XS2S100 SpartanII board.
The entire design has been simulated using XS2S150 SpartanII board’s timing library.
Since this board is not available, the performance numbers have been measured using the
back-annotated timing analysis. Given below are the FPGA usage statistics:
• Logic Utilization Total Number of Slice Registers: 1,144 out of 3,456 (33%)
• Logic Distribution Number of occupied Slices: 1,726 out of 1,728 (99%)
• Total Number of 4 input LUTs: 1,941 out of 3,456 (56%)
1. Number used as logic: 1,719
2. Number used as route-through: 8
3. Number used as Distributed RAM: 214
• Number of Block RAMs: 9 out of 12 (75%)
• Total equivalent gate count for design: 176,166 A logic gate on an average consists
of 4 to 6 transistors. Thus, an ASIC version of our design would have 6*176,666,
i.e., 1.05 million transistors. Although it seems that we are only using generic
datapaths and a million transistors is quite high but most of the transistor budget
is going in the onchip memory. Microcode store has used up approximately 7000
gates. CAMs have utilized a total 35,392 logic gates (0.212 million transistors).
35
Each copy of packet store is made out of 32,728 logic gates (0.192 million tran-
sistors). Since we have three copies of packet store, total number of logic gates
utilized by packet stores is 98,304 gates (0.528 million transistors). If we assume
only a single copy of packet store that has multiple read and write ports, then our
total gate count is 111,130 (0.666 million transistors) out of which 85,658 gates
(0.513 milliontransistors) consitute memory logic.
CHAPTER 4
EVALUATION
This architecture has been evaluated based on the execution time of each microin-
struction, execution time of each microengine and evaluating an ASIC version with a
similar Click-based router configuration. As this architecture is based on many assump-
tions, the throughput analysis may not be very realistic. A complete power consumption
analysis for the FPGA prototype has not been done, as power consumption of an FPGA
is high and it is difficult to extrapolate the power-consumption result for an ASIC version
of the same design. But asynchronous microengines will definitely have low power
consumption as is the characteristic of asynchronous architectures [2, 23, 9] and also
because this architecture style has fine-grain clock gating.
4.1 Evaluation of Async Advantages
4.1.1 Demonstration of Flexibility
As seen from the Figure 4.1, there are three parallel chains of datapath executions in
the 6th microinstruction of microcode. By enabling the set-exe bits of CAM filter and
flow-id in execution chain 3, this architecture can be extended to support firewalling.
The global control bits were also modified in the 6th microinstruction as a conditional
branch microinstruction was evaluated (i.e., if the packet’s source IP address is a blocked
IP address then the microengine needs to discard it and jump to the discard microin-
struction). To enable firewalling, a total of 10 bits in a 111-bit wide microinstruction
were modified. Upon modification of the microcode, no changes in the architecture’s
per packet performance were observed, as the execution chain 1 is the longest chain and
enabling or disabling other execution chains (2 and 3) will not have any impact on the
37
Execution Chain 1 Execution Chain 2 Execution Chain 3
8−bit ALU 16−bit ALU
MemoryPacket Header 16−bit Temp
Registers Flow−id
Header RegisterFile
CAM Filter
Figure 4.1. Datapath execution in the 6th microinstruction of IP header processingmicroengine
execution time. Thus, it has been demonstrated that this architecture can be extended by
modifying the microcode.
4.1.2 Demonstration of Average-case Completion Times
Each microinstruction’s execution time has been measured as the time taken by
global request’s handshake. Long execution times for microinstructions suggest long
chains of datapath executions. Tables 4.1 and 4.2 show the execution times for each
microinstruction for both microengines. Figures 4.2 and 4.3 show the execution times
for all microinstructions.
4.1.2.1 Synchronous Version of Microengine
There are two ways to extrapolate a synchronous version of an asynchronous microengine-
based design:
Table 4.1. Microinstruction execution times of an ingress microengine
microinstruction Execution time (ns)1 130.802 72.773 165.384 39.69
38
Table 4.2. Microinstruction execution times for IP header processing microengine
Microinstruction Execution time (ns)1 142.232 68.833 155.464 109.155 109.156 109.157 112.468 109.159 102.53
10 100.0011 99.2312 79.3813 115.7714 92.6115 62.84
Figure 4.2. Microinstruction execution times of ingress microengine
• Taking the worst-case microinstruction execution time to be the clock period as the
clock signal needs to wait for the longest datapath execution chain to complete.
• Taking the worst-case execution time of the slowest datapath as the clock period.
The second approach may make the synchronous counterpart look worse as now there
will be more shorter microinstructions and more control overhead associated with fetch-
ing each microinstruction. Also, this would not be an exact counterpart for an asyn-
39
Figure 4.3. Microinstruction execution times of IP header processing microengine
chronous microengine which chains datapaths and has fewer microinstructions. Hence,
the first approach has been chosen for this evaluation. Conservatively, a synchronous
implementation of this design will have a clock period = maximum microinstruction
execution time + 10%. Thus, synchronous version of the Ingress Microengine will have
a clock-period of 181.9 ns and that of the IP Header-Processing Microengine will have
a clock-period of 170.5 ns. Tables 4.3 and 4.4 show total execution times for various
packet-types for both asynchronous and synchronous microengine-based implementa-
tions. The asynchronous version of the ingress microengine demonstrates average-case
completion time compared to the synchronous version. In the case of IP-header pro-
cessing microengine, the async implementation performs better than the synchronous
version except in the case for a correct IP packet. The asynchronous IP-header procesing
microengine is slower than its synchronous counterpart because:
• In a four-phase protocol, no useful work is being done during the return to zero
part of the handshake. In each microinstruction, the return to zero part is taking
up 68-70ns due to an unoptimized version(vhdl description) of c-element being
used (this c-element needs to synchronize 8 Ack signals). When the number
of microinstructions executed increases, the additive effect of return to zero part
increases.
Tables 4.3 and 4.4 show the execution times during the active phase of handshake for
the asynchronous implementation. There are many techniques to hide this delay that have
not been implemented [12] and have been discussed in the next section. The average case
40
Table 4.3. Ingress microengine per packet execution time
Packet µ-instructions Async Async Synctype executed version Active phase version
Execution Execution Executiontime(ns) time(ns) time(ns)
IP 3 353.97 243.26 545.7ARP 5 514.32 448.64 909.5
Table 4.4. IP header processing microengine per packet execution time
Packet µ-instructions Async Async Synctype executed version Active phase version
Execution Execution Executiontime(ns) time(ns) time(ns)
IP with 4 516.00 429.36 682incorrect
ver and hlenFiltered 8 1309.85 869.27 1534.5
IPIP with 10 1653.85 1080.95 1875.5
expired ttlIP whose 14 2434.53 1499.11 2387route lookup failed
Correct IP 14 2460.17 1499.11 2387
completion time advantage of this approach, at the level of microcode instruction exe-
cution has been evaluated, which can be further improved by using completion-sensing
datapaths [7, 4, 20, 24] instead of bundled delays.
4.1.3 Power Consumption Analysis
In this thesis work, a complete power consumption measurement has not been done
as the FPGA power consumption is quite high and it cannot be extrapolated to that of
an ASIC. To evaluate the low power dissipation advantage of this design approach, the
fine-grain clock gating feature of asynchronous microengines has been demonstrated. In
an asynchronous microengine only the datapaths that have been set to execute during
41
a given microinstruction execute. Hence, there is lower power consumption as not all
datapaths execute during each microinstruction execution cycle. Tables 4.5 and 4.6 show
how many datapaths execute in each microinstruction execution cycle for both ingress
and IP-header processing microengines.
4.2 Improving Asynchronous Microengines
The existing implementation of the microengines is not optimized with respect to
performance. The microinstruction is latched only when the ECU is done synchronizing
all datapaths and sufficient time is given for the microinstruction to propagate to the dat-
apaths and set them up along with their RAS blocks. This introduces significant control
overhead of the order of at least 20ns for each microinstruction execution cycle. Also,
the ECU has to wait for the longest datapath thread to finish execution before starting the
next cycle. This introduces significant computational overhead. Methods suggested to
improve the performance of the unoptimized microengine in [12] are summarized below:
Table 4.5. The number of datapaths executing in the IP header processing microengine
microinstruction Number of datapathsexecuting out of total number
1 6/112 1/113 5/114 5/115 5/116 7/117 5/118 4/119 2/11
10 3/1111 3/1112 3/1113 5/1114 4/1115 0/11
42
Table 4.6. The number of datapaths executing in the ingress microengine
microinstruction Number of datapathsexecuting out of total number
1 7/92 2/93 8/94 0/9
• Reducing control overhead: In the four-phase protocol design, the control over-
head can be reduced by latching the next microinstruction with the falling edge
of the global request during the return to zero phase. This allows the setup of
datapaths and RAS blocks when the ECU performs synchronization. This helps
hiding the wasteful return to zero part of a four-phase protocol and can greatly
improve performance.
• Reducing computational overhead: The microengine has to wait for the longest
thread to finish execution before beginning the next execution cycle. These long
latency operations can block the execution of other concurrent thread datapaths
which are done executing earlier. This computational overhead can be consid-
erably reduced by introducing decoupling of datapaths. This technique allows
nondecoupled datapaths to execute the next microinstruction while decoupled-
datapaths are still executing the previous microinstruction. This optimization in-
volves considerable changes to the design of ECU and RAS blocks. An extra
set-decouple microcode field is introduced for each datapath. The ECU no longer
needs to wait for all acknowlegements and will resynchronize with decoupled
datapaths only when it needs to. The circuit-level details have been explained
in [12].
With respect to the FPGA implementation of this architecture, the c-element that
has been used is not optimized. If a c-element with a fan-in of 8 is used instead of
connecting 4 c-elements with fan-in of 2, the synchronization time gets reduced by 7 ns
for each synchronization. This makes the asynchronous version of IP header microengine
43
perform the same as the synchronous version for the longest number of microinstruction
executions. Also, this c-element have been designed as gate-level VHDL description due
to an FPGA implementation instead of fast transistor-level circuit.
4.3 Performance EvaluationMicroengine’s per packet execution time is denoted by the time taken by the external
request’s handshake to complete. The execution time depends from packet to packet.
If any of the IP checks fail, then the microengine will execute fewer microinstructions
and hence, execution time will be reduced. Tables 4.3 and 4.4 show the execution times
for each microengine for all the possible packet cases. For evaluation purposes, we are
considering the total execution time for an ingress microengine for classifying an IP
packet as 353.97 ns and the execution time for IP header processing microengine for
processing an IP packet which gets sent to one of the output ports as 2460.17 ns.
4.3.1 Comparison with Click Software Router
As a starting point, this router’s performance has been compared with a similar Click
software router configuration [18]. Per packet execution times for each of click soft-
ware’s elements running on a 700Mhz Pentium III PC hardware are shown in Table 4.7.
The reason for choosing a Pentium III instead of a gigahertz system for evaluation of this
work is because of the availability of the published numbers in the Click paper [18].
The ingress microengine has implemented Classifier and Paint elements. IP header
processing microengine has implemented Strip, CheckIPheader, GetIPHeader, Lookup-
Table 4.7. Click element’s per packet execution time
Element Time (ns)Classifier 70
Paint 77Strip 67
CheckIPheader 457GetIPaddress 120
LookupIPRoute 140DecIPttl 119
44
IPRoute and DecIPTTL. This evaluation is based on an approximate implementation of
these Click elements as the details such as the number of route table entries in their
element, etc. are not known. Using the execution numbers from the table, it can be
seen that the ingress microengine’s equivalent Click router has a per packet execution
time of 147 ns which is 2.4x better than the ingress microengine. Similarly, IP header
processing microengine’s functionally equivalent click router has a per packet execution
time of 903 ns which is 2.7x better than the microengine implementation. However,
this implementation is an FPGA prototype and that too a SpartanII part which has lower
performance than newer high-performance FPGA parts. Hence. extrapolation of this
FPGA prototype performance results to an ASIC version will give even a more realistic
comparison.
4.3.1.1 Extrapolation of FPGA Results to an ASIC Version
First a 0.5u SCMOS process standard cell library [10] of a generalized asynchronous
microengine control and datapath modules was used for performance comparison against
the SpartanII FPGA modules. A few blocks like ECU, asynchronous RAM, etc. were
implemented on the Xilinx board and their timing was compared with the available stan-
dard cell library. The average execution time speedup that was observed is approximately
5x which factors in the I/O pins delay. This speedup factor is very conservative as
this FPGA to ASIC extrapolation does not take into account the interconnect speedup
that would be much higher than 5. Since the Spartan-II family is designed in a 0.25u,
5LM process technology and the given SCMOS implementation is in a 0.5u process, an
additional speedup of 2x for comparing the two implementations in the same process
technology (0.25u process) was factored in. This assumption of 2x speedup is based
on the constant field scaling model [31] that states that if the feature size of a CMOS
process is scaled down by α, then the gate delay speeds up approximately by α. Thus, the
performance speedup factor for FPGA to ASIC conversion is now 10x. Total per-packet
processing time for a correct IP packet taken by an ASIC version of IP header processing
microengine would roughly be 246.01 ns and that of an ingress microengine would
roughly be 35.39 ns. The asynchronous microengine based router’s performance is 3.9x
45
Click software’s performance that is running on a 700MHz Pentium III (0.18u). The
reason the results from the Click configuration running on a 700Mhz PentiumIII machine
were used instead of using present gigahertz machine was that these were validated and
published results. Although Click’s performance will improve if it runs on a gigahertz
machine, even this design’s performance will scale if it used better process technologies
and design techniques. Assuming a linear scaling factor based on constant field scaling
model, the microengine-based router’s performance is 5.1x Click software’s performance
in the same 0.18u process technology. Since this work has been implemented on a FPGA,
high speed transistor based circuit design techniques were not used. By using better
processes and design techniques for custom CMOS implementation, there is room for a
lot more improvement.
4.3.2 Comparison with Intel’s IXP1200 Network Processor
In this subsection, the per-packet processing time of the extrapolated ASIC version
has been compared with Intel’s IXP1200. This network processor has been chosen for
comparison because its performance evaluation as a router has been done at Princeton
university [27]. Intel 1XP1200 [11] has six RISC cores aka microengines which run
at 200MHz along with a StrongARM core. The data-plane processing is handled by
the microengines whereas the control-plane processing is done by StrongARM. IXP
Microengines are able to do minimal IP forwarding at the rate of 3.47Mpps where each
packet has a size of 64 bytes. The total number of cycles spent in forwarding by a
single microengine including memory access latency adds up to 710 cycles (Each cycle
takes 5ns). The memory latency in the case of IXP gets hidden by multiple contexts
and the system as a whole is able to output a packet every 288ns. This paper [27] has
also mentioned that for doing just IP forwarding, per-packet processing time is 32 cycles
of register instructions which is equivalent to 160ns (32*5ns). However, the authors
have mentioned that their minimal IP forwarder only does ttl decrement, checksum
recompute and ethernet header replacement. The other functions like IP header check
which includes version and checksum check, etc, is done by their classifier. However,
there are no results given for classifier operation. Also, in this thesis the ethernet header
46
replacement has not been implemented. It is very difficult to get an exact performance
number for the tasks that are being performed in the microengine-based router and
comparing those to tasks done by the IXP Microengines. It is still not very clear which
process technology was used for this IXP evaluation. In 1999, the IXP 1200 was based
on 0.28u process and a little later 0.18u process was used. The microengine-based router
(ASIC version 0.18u process) takes 180 ns for doing ttl decrement, checksum check and
recomputation, IP header checks, stateless firewalling and route-lookup. If the ASIC
version of microengine-based router is assumed to be in 0.28u process, the IP header
processing takes 275.5 ns and the packet classification takes 39.63 ns. It seems like in
terms of per-packet performance, this thesis has promising results.
4.4 Throughput Analysis
In the steady state, the maximum throughput of a network device is given by the
processing time of the slowest element. The maximum loss-free throughput of the
microengine-based router(FPGA version) is 406,504 34-byte packets per second whereas
that of the Click router is 333,000 64-byte packets per second. The extrapolated ASIC
version of our implementation would have a throughput of 4,065040 34-byte packets per
second.
CHAPTER 5
CONCLUSIONS AND FUTURE WORK
In this thesis, a case for asynchronous microengines in the network processing do-
main has been presented. This asynchronous microengine-based approach tries to fill
the performance gap between a specialized ASIC and a more general network processor
implementation. It does this by providing a microcoded framework which is close in
performance to ASICs and is also programmable at the finer granularity of microcode.
These asynchronous microengines can be used for designing an asynchronous network
processor architecture which takes advantage of modularity, flexibility, high performance
and low power consumption aspects of an asynchronous design approach.
5.1 Future Work• As part of the future work, completion-sensing can be added to the datapaths. This
would give an average-case performance advantage not just at the microinstruction
level but also at the datapath execution level.
• Another way of increasing this design’s performance would be to implement tech-
niques to reduce the control overhead in the return to zero phase of the four-phase
handshake.
• Implementing an ASIC version would allow doing a more detailed performance
and power measurements.
• It would be interesting to look into microcode generation tools as the present way
of hand coding the binary code is error-prone.
• Measurement of the router’s maximum throughput by working on the assumptions
and simulating the router using real-world network traffic at least in simulation.
APPENDIX A
MICROCODE
A.1 Ingress Microengine Microcode
The microcode for our Ingress Microengine consists of 4 micro-instructions 67 bits
wide. The local control fields corresponding to each datapath and the global control
fields in the microcode have been explained below.
1. Address and Paint Register: This datapath is always the first to execute in any of
the execution chains and has only a set-exe bit associated with it. Microcode for
the address and paint register is shown in Table A.1.
2. 8-bit ALU: This datapath has a set-exe bit, set-seq bit and one of the 8-bits operand
values(offset for the memory address) in the microcode as shown in Table A.2.
3. Offset-address Register: This datapath has been added to prevent the setup time
violation of the packet memory datapath and has a set-seq bit associated with it
as it always executes in sequence with the 8-bit ALU. The microcode is shown in
Table A.3.
4. Packet memory: The packet memory always executes in sequence and does not
need a separate set-exe bit. The microcode for this datapath also has memory
Table A.1. Address register’s microcode
µ-instruction set-exe1 12 03 04 0
49
Table A.2. 8-bit ALU’s microcode
µ-instruction set-exe set-seq 8-bit Operand1 1 1 0x03
2 0 0 0x00
3 1 0 0x04
4 0 0 0x00
Table A.3. Offset register’s microcode
µ-instruction set-exe1 12 03 14 0
related control signals such as the write-enable(we) bit and enable(en) bit. There
is a mux select signal(sel-add) for selecting the address value to be read by the
memory (the value from the address register or the address value after offset has
been added to it). The microcode is shown in Table A.4.
5. Header Register: This datapath has a set-seq bit associated with it and an input
mux signal(sel) for selecting the most significant 16 bits of the packet header or
the least significant 16 bits. The microcode is shown in Table A.5.
6. 16-bit Comparator A: This datapath has a set-seq bit associated with it along with a
16-bit operand value and a 2-bit rule-id field. The microcode is shown in Table A.6.
Table A.4. Packet memory’s microcode
µ-instruction set-seq we en sel-add1 1 0 1 12 0 0 0 03 1 0 1 14 0 0 0 0
50
Table A.5. Header register’s microcode
µ-instruction set-seq sel1 1 02 0 03 1 04 0 0
Table A.6. 16-bit comparatorA’s microcode
µ-instruction set-seq 2-bit Rule-id 16-bit Operand1 1 00 0x0800
2 0 00 0x0000
3 1 10 0x0001
4 0 00 0x0000
7. 16-bit Comparator B: This datapath has similar set of microcode fields as com-
parator A and its microcode is shown in Table A.7.
8. Flow-id Register: This datapath has both the set-exe and the set-seq fields associ-
ated with it and its microcode is shown in Table A.8.
9. Send-to-fifo: This datapath always executes in sequence to the flow-id register and
has only set-seq field and its microcode is shown in Table A.9.
10. The global control fields of this microengine consist of the eval-bit, branch-pred
bit, sel-add bit, 4-bits micro-instruction address and the done bit. The microcode
for the global control is shown in Table A.10.
Table A.7. 16-bit comparatorB’s microcode
µ-instruction set-seq 2-bit Rule-id 16-bit Operand1 1 01 0x0806
2 0 00 0x0000
3 1 11 0x0002
4 0 00 0x0000
51
Table A.8. Flow-id register’s microcode
µ-instruction set-exe set-seq1 0 02 1 03 1 14 0 0
Table A.9. Send-to-fifo’s microcode
µ-instruction set-seq1 02 13 14 0
Table A.10. Global control microcode
µ-instruction eval branch-pred sel-add 4-bit Operand done1 1 0 0 0x2 02 0 0 1 0x3 03 0 0 0 0x0 04 0 0 1 0x0 1
A.2 IP-Header Processing Microengine
The microcode for our IP-header processing microengine has 15 micro-instructions
that are 111-bits wide. In this microcode, we have not added constant values as operand
fields and have used constant registers instead. This was done so save up on limited
on-chip memory resources. The local control fields corresponding to each datapath and
the global control fields in the microcode have been explained below.
1. Address and Paint Register: This datapath is always the first to execute in any
of the execution chains and thus has only a set-exe bit associated with it and its
microcode is shown in Table A.11.
52
Table A.11. Microcode for IP microengine’s address and paint registers
µ-instruction set-exe1 12 03 04 05 06 07 08 09 0
10 011 012 013 014 015 0
2. 8-bit ALU: This datapath has a set-exe bit, set-seq bit with respect to the address
register, 2-bit opcode, set-mux for selecting one of the inputs and the 8-bits operand
values(offset for the memory address) in the microcode as shown in Table A.12.
3. Packet header store: This datapath consists of set-seq field with respect to the 8-bit
ALU, write enable fields corresponding to the two block RAMs and enable field
(en). The microcode for this datapath is shown in Table A.13.
4. Header Register File: This datapath has been implemented by using 4 dual port
synchronous RAMs and has the widest microcode. The microcode consists of
set-exe, set-seq fields and 4-bit wide read and write port addresses for all the 4
blocks (each block has a byte-wide data in and out port). It also has set-mux fields
for selecting the inputs and outputs. The microcode for the header register file is
shown in Table A.14 and A.15.
5. 16-bit ALU: This datapath has the set-exe field, the set-seq field with respect to
the header register file, a 2-bit and a single bit input set-mux fields as shown in
Table A.16.
53
Table A.12. Microcode of IP microengine’s 8-bit ALU
µ-instruction se ss-add 2-bit opcode sm 8-bit Operand1 1 1 00 0 0x03
2 0 0 00 0 0x00
3 1 0 00 0 0x04
4 1 0 00 0 0x05
5 1 0 00 0 0x06
6 1 0 00 0 0x07
7 1 0 00 0 0x08
8 0 0 00 0 0x00
9 0 0 00 0 0x00
10 1 0 01 1 0x01
11 0 0 00 0 0x00
12 0 0 00 0 0x00
13 1 0 00 0 0x05
14 1 0 00 0 0x06
15 0 0 00 0 0x00
Table A.13. Microcode of IP microengine’s packet store
µ-instruction ss-8bit-ALU we1 we2 en1 1 0 0 12 0 0 0 03 1 0 0 14 1 0 0 15 1 0 0 16 1 0 0 17 1 0 0 18 0 0 0 09 0 0 0 010 0 0 0 011 0 0 0 012 0 0 0 013 1 0 1 114 1 1 0 115 0 0 0 0
54
Table A.14. Microcode of IP microengine’s header register file
µ se ss ss we(1:4) wr wr wr wrinsn packet tempreg add-b1 add-b2 add-b3 add-b4
1 1 1 0 0xC 0x0 0x0 0x0 0x0
2 0 0 0 0x0 0x0 0x0 0x0 0x0
3 1 1 0 0xF 0x1 0x1 0x0 0x0
4 1 1 0 0xF 0x2 0x2 0x1 0x1
5 1 1 0 0xF 0x3 0x3 0x2 0x2
6 1 1 0 0xF 0x4 0x4 0x3 0x3
7 1 1 0 0x3 0x0 0x0 0x4 0x4
8 0 0 0 0x0 0x0 0x0 0x0 0x0
9 0 0 0 0x0 0x0 0x0 0x0 0x0
10 0 0 0 0x0 0x0 0x0 0x0 0x0
11 0 0 0 0x0 0x0 0x0 0x0 0x0
12 1 0 0 0x8 0x2 0x0 0x0 0x0
13 1 0 1 0x3 0x0 0x0 0x2 0x2
14 0 0 0 0x0 0x0 0x0 0x0 0x0
15 0 0 0 0x0 0x0 0x0 0x0 0x0
6. 16-bit Comparator: This datapath has set-mux fields for input operands, a 2-bit
opcode, set-exe and set-seq field with respect to the header register file and its
microcode is shown in Table A.17.
7. 16-bit temporary registers: This datapath has two temporary registers(A and B)
and each of them have a separate enable field and they share the set-seq field and
its microcode is shown in Table A.18.
8. Route Look-up CAM, Filter CAM and Send-to-Fifo: These CAMs have the set-exe
field each. The lookup CAM has additional fields: write enable(we) and enable(en)
for the look-up route table Block RAM. The Send-to-fifo datapath has the set-exe
field and the set-seq field with respect to flow-id register datapath. The microcode
for these datapaths has been shown in Table A.19.
55
Table A.15. Microcode of IP microengine’s header register file contd.
µ sm-in1 sm-in2 sm-in3 read read read read sm-outinsn add-b1 add-b2 add-b3 add-b4 1 to 4
1 0 0 0 0x0 0x0 0x0 0x0 0xF
2 0 0 0 0x0 0x0 0x0 0x0 0x0
3 0 0 0 0x0 0x0 0x0 0x0 0x3
4 0 0 0 0x1 0x1 0x0 0x0 0x0
5 0 0 0 0x0 0x0 0x1 0x1 0x0
6 0 0 0 0x3 0x3 0x0 0x0 0x0
7 0 0 0 0x0 0x0 0x3 0x3 0x0
8 0 0 0 0x2 0x0 0x4 0x4 0x0
9 0 0 0 0x4 0x4 0x0 0x0 0x0
10 0 0 0 0x2 0x2 0x0 0x0 0x0
11 0 0 0 0x4 0x0 0x2 0x2 0x0
12 1 0 0 0x0 0x0 0x0 0x0 0x0
13 0 1 1 0x2 0x2 0x0 0x0 0x0
14 0 0 0 0x0 0x0 0x2 0x2 0x0
15 0 0 0 0x0 0x0 0x0 0x0 0x0
Table A.16. Microcode of IP microengine’s 16-bit ALU
µ-instruction se ss-header 2-bit sm-in1 sm-in21 0 0 00 02 0 0 00 03 1 1 00 14 1 0 10 05 1 0 10 16 1 0 10 07 1 0 10 18 1 0 10 19 1 0 10 0
10 1 0 10 011 1 0 10 112 0 0 00 013 1 0 11 014 0 0 00 015 0 0 00 0
56
Table A.17. Microcode of IP microengine’s 16-bit comparator
µ- se ss- 3-bit 3-bit 2-bit sm-in4 2-bitinstruction header sm-in1 sm-in2 sm-in3 opcode
1 1 1 100 000 11 0 012 0 0 000 000 00 0 003 0 0 000 000 00 0 004 0 0 000 000 00 0 005 0 0 000 000 00 0 006 0 0 000 000 00 0 007 0 0 000 000 00 0 008 1 0 100 000 00 0 109 0 0 000 000 00 0 00
10 0 0 000 000 00 0 0011 0 0 000 000 00 0 0012 1 0 101 100 10 1 0013 0 0 000 000 00 0 0014 0 0 000 000 00 0 0015 0 0 000 000 00 0 00
Table A.18. Microcode of IP microengine’s 16-bit temporary registers
µ-instruction ss-16bitALU enA enB1 0 0 02 0 0 03 1 1 04 1 1 05 1 1 06 1 1 07 1 1 08 1 1 09 1 1 1
10 1 1 011 1 1 012 0 0 013 1 1 014 0 0 015 0 0 0
57
Table A.19. Microcode of IP microengine’s CAMs and send-to-fifo datapaths
µ- CAM- CAM- CAM- CAM- send sendinstruction filter lookup lookup lookup to-fifo to-fifo
se se en we se ss-flow1 0 0 0 0 0 02 0 0 0 0 1 03 0 0 0 0 0 04 0 0 0 0 0 05 0 0 0 0 0 06 1 0 0 0 0 07 0 0 0 0 0 08 0 0 0 0 0 09 0 0 0 0 0 0
10 0 0 0 0 0 011 0 1 1 0 0 012 0 0 0 0 0 013 0 0 0 0 0 014 0 0 0 0 1 115 0 0 0 0 0 0
9. Flow-id Register: This datapath has the set-exe field, set-seq field, set-mux fields
for the inputs and flow-id rules and its microcode is shown in Table A.20.
10. Global Control Fields: determine the next micro-instruction to be fetched and are
same as those for the Ingress Processing Microengine and its microcode is shown
in Table A.21.
58
Table A.20. Microcode of IP microengine’s flow-id register
µ- se ss- ss- sm-in1 sm-in2 sm-in3 2-bit 2-bitinsn cmp16 camF Flow1 Flow2
1 1 1 0 0 1 0 00 012 0 0 0 0 0 0 00 003 0 0 0 0 0 0 00 004 0 0 0 0 0 0 00 005 0 0 0 0 0 0 00 006 1 0 1 0 0 1 00 017 0 0 0 0 0 0 00 008 1 1 0 0 1 0 00 019 0 0 0 0 0 0 00 00
10 0 0 0 0 0 0 00 0011 0 0 0 0 0 0 00 0012 1 1 0 0 1 0 00 0113 0 0 0 0 0 0 00 0014 1 0 0 1 0 0 00 0115 0 0 0 0 0 0 00 00
Table A.21. Microcode of IP microengine’s global control
µ- eval- branch-pred sel- eval- sel- 4-bit doneinstruction cmp16 cmp16 cmp camF add add
1 1 1 0 0 1 0x2 02 0 0 0 0 1 0xE 03 0 0 0 0 0 0x0 04 0 0 0 0 0 0x2 05 0 0 0 0 0 0x0 06 0 0 0 1 0 0x1 07 0 0 0 0 0 0x0 08 1 1 1 0 0 0x1 09 0 0 0 0 0 0x0 0
10 0 0 0 0 0 0x0 011 0 0 0 0 0 0x0 012 1 1 1 0 0 0x1 013 0 0 0 0 0 0x0 014 0 0 0 0 0 0x0 015 0 0 0 0 0 0x0 1
APPENDIX B
SOURCE CODE
library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_signed.all;use IEEE.std_logic_arith.all;
entity celement isport (clear, a, b : in STD_LOGIC;
c : inout STD_LOGIC);end celement;
architecture behv of celement isbegin
process (clear,c,a,b)begin -- process clr
if clear = ’1’ thenc <= ’0’;
elsif a = ’1’ and b = ’1’ thenc <= ’1’;elsif a = ’0’ and b = ’0’ then
c <= ’0’;
else c <= c;
end if;end process ;
end behv;
Figure B.1. Source code of a 2-input c-element
60
library IEEE;use IEEE.std_logic_1164.all;
entity BDU is
port (eval_camf, eval_cmp16, matchf : in std_logic;
cmp16_out, braped_cmp16 : in std_logic;clear : out std_logic);
end BDU;
architecture behv of BDU issignal branch : std_logic;begin
branch <= (eval_camf and matchf) or(eval_cmp16 and cmp16_out);clear <= branch xor braped_cmp16;
end behv;
Figure B.2. Source code of IP-header processing microengine’s BDU
61
library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_arith.all;use IEEE.std_logic_unsigned.all;
entity next_add isport (req,sel_addr,extack, clear, reset : in std_logic;branch_addr : in std_logic_vector(3 downto 0);--inc_addr_reg : inout std_logic_vector(3 downto 0);next_addr : out std_logic_vector(3 downto 0));
end next_add;architecture behv of next_add issignal int_nxt_addr,inc_addr,inc_addr_reg
: std_logic_vector(3 downto 0);
begin -- behvprocess (int_nxt_addr)begin -- processinc_addr <= int_nxt_addr +1;
end process;
process (inc_addr,clear,req, extack)begin -- processif (extack = ’1’) then
inc_addr_reg <= "0000";elseif (req’event and (req = ’1’)) then
if (clear = ’0’) theninc_addr_reg <= inc_addr;
end if;end if;
end if;end process;
process (inc_addr_reg,branch_addr,sel_addr)begin -- process
if (sel_addr = ’0’) thenint_nxt_addr <= inc_addr_reg;
elseint_nxt_addr <= branch_addr;
end if;end process;
Figure B.3. Source code of next-address logic
62
process (reset, int_nxt_addr )begin -- processif (reset = ’1’) then
next_addr <= "0000";else
next_addr <= int_nxt_addr;end if;
end process;
end behv;
Figure B.4. Source code of next-address logic contd.
63
library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_arith.all;
entity mi_reg is
port (a : in std_logic_vector(110 downto 0);req,clear,sel_addr : in std_logic;reset : in std_logic;
q : inout std_logic_vector(110 downto 0));
end mi_reg;
architecture rtl of mi_reg isbeginprocess (reset, req)beginif reset = ’1’ then
q <= "00000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000";else
if rising_edge(req) thenif clear = ’1’ then
q(110 downto 6) <= "000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
000000000";q(5) <= sel_addr;q(4 downto 1) <= q(4 downto 1);q(0) <= ’0’;
elseq <= a;
end if;
end if;
end if;end process;
end rtl;
Figure B.5. Source code of IP-header processing microengine’s microinstruction register
64
This module toggles the value of sel-addr fieldif the clear goes high i.e if the branch wasmispredicted
library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_signed.all;use IEEE.std_logic_arith.all;
entity toggle_sel_addr is
port (clear : in STD_LOGIC;sel_addr_out : out STD_LOGIC;sel_addr_in : in STD_LOGIC);
end toggle_sel_addr;
architecture behv of toggle_sel_addr isbegin
process (clear, sel_addr_in)begin -- processif (clear = ’1’) then
sel_addr_out <= not sel_addr_in;
elsesel_addr_out <= sel_addr_in;
end if;
end process;end behv;
Figure B.6. Source code of sel-addr toggle module
65
library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_arith.all;
entity add_reg is
port (add_req : in std_logic;add_in : in std_logic_vector(7 downto 0);address : out std_logic_vector(7 downto 0));
end add_reg;
architecture behv of add_reg is
beginprocess (add_req,add_in)
beginif rising_edge(add_req) then
address <= add_in;end if;
end process;end behv;
Figure B.7. Source code of the address register
66
library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_unsigned.all;use IEEE.std_logic_arith.all;
entity ALU8 is
port (alu8_ack : in STD_LOGIC;op : in STD_LOGIC_VECTOR(1 downto 0);alu8_out : out STD_LOGIC_VECTOR(7 downto 0);micro_val : in STD_LOGIC_VECTOR(7 downto 0);alu8_in : in STD_LOGIC_VECTOR(7 downto 0));
end ALU8;
architecture behv of ALU8 is
signal result : STD_LOGIC_VECTOR(7 downto 0);
beginprocess (op, alu8_in, micro_val)begin -- processif op = "00" then
result <= alu8_in + micro_val;else
if op = "01" thenresult <= alu8_in - micro_val;
elsif op = "10" thenresult <= alu8_in and micro_val;
elseresult <= alu8_in or micro_val;
end if;end if;
end process;
process (alu8_ack)begin -- processif rising_edge(alu8_ack) thenalu8_out <= result;
end if;end process;
end behv;
Figure B.8. Source code of the 8-bit ALU
67
library IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_unsigned.all;use IEEE.std_logic_arith.all;
entity ALU16 is
port (alu16_ack : in STD_LOGIC;
alu16_out : out STD_LOGIC_VECTOR(15 downto 0);ain : in STD_LOGIC_VECTOR(15 downto 0);bin : in STD_LOGIC_VECTOR(15 downto 0));
end ALU16;
architecture behv of ALU16 is
signal result : STD_LOGIC_VECTOR(15 downto 0);
beginprocess (ain, bin)begin -- process
result <= ain + bin;end process;
process (alu16_ack)begin -- processif rising_edge(alu16_ack) thenalu16_out <= result;
end if;end process;
end behv;
Figure B.9. Source code of the 16-bit ALU
68
ibrary IEEE;use IEEE.std_logic_1164.all;use IEEE.std_logic_unsigned.all;use IEEE.std_logic_arith.all;
entity cmp16 is
port (cmp16_ack : in STD_LOGIC;cmp16_out : out STD_LOGIC;op : in STD_LOGIC_VECTOR(1 downto 0);a : in STD_LOGIC_VECTOR(15 downto 0);b : in STD_LOGIC_VECTOR(15 downto 0));
end cmp16;
architecture behv of cmp16 is
signal result : STD_LOGIC;
beginprocess (a,b,op)begin -- processcase(op) iswhen "00" =>
IF ( a = b) thenresult <= ’1’;
elseresult <= ’0’;end if;
when "01" =>if ((a(7 downto 4) = b(7 downto 4)) and (a(3 downto 0)<= b(3 downto 0))) then
result <= ’1’;else
result <= ’0’;end if;
Figure B.10. Source code of the 16-bit comparator
69
when "10" =>if (a > b) then
result <= ’1’;else
result <= ’0’;end if;
when others =>result <= ’0’;
end case;
end process;
process (cmp16_ack)begin -- processif rising_edge(cmp16_ack) thencmp16_out <= result;
end if;end process;
end behv;
Figure B.11. Source code of the 16-bit comparator contd
70
This RAS block produces the datapath requestonly if the set-exe is set and global request has arrived.It produces the seq-ack only if set-exe is set and afterit has received datapath acknowledge.
library IEEE;use IEEE.std_logic_1164.all;
entity RAS_ADDPID is
port (req : in std_logic;se : in std_logic;add_ack : in std_logic;add_req : out std_logic;sack : out std_logic);
end RAS_ADDPID;
architecture behv of RAS_ADDPID issignal se_inv : std_logic;begin -- behv
add_req <= req and se;se_inv <= not se;sack <= add_ack;
end behv;
Figure B.12. Source code of the RAS block with set-exe
71
This RAS block produces the datapath request uponthe following conditions:a) set-exe is set and global request has arrivedb) if set-seq is set then it needs to wait for seq-reqto arrive tooIt produces the seq-ack only whenset-exe is set and it has received datapath ack.It produces the Ack signal whena) if set-exe is set then it waits for datapath ackb) if set-exe is not set then it produces the Ack afterglobal req arrives.
library IEEE;use IEEE.std_logic_1164.all;
entity ras_alu8 is
port (req : in std_logic;se : in std_logic;alu8_ack : in std_logic;alu8_req : out std_logic;sack_alu8 : out std_logic;ss_add : in std_logic;sack_add : in std_logic;ack1 : out std_logic);
end ras_alu8;
architecture behv of ras_alu8 issignal se_inv, ss_add_inv : std_logic;begin -- behv
ss_add_inv <= not ss_add ;alu8_req <= (req and se and (ss_add_inv or sack_add));se_inv <= not se;sack_alu8 <= alu8_ack;ack1 <= (se_inv and req) or alu8_ack;
end behv;
Figure B.13. Source code of the RAS block with set-exe and set-seq
72
This RAS block generates the datapath reqwhen its set-seq bit is set and it has receivedthe seq-req and the global request.It generates the seq-ack only when set-seq is set andit has received the datapath ack.It generates the Ack only when set-seq is set andit has received the datapath ack or if set-seq is notset then upon arrival of global req.
library IEEE;use IEEE.std_logic_1164.all;
entity ras_packet is
port (ss_alu8 , sreq_alu8, req, packet_ack : in std_logic;ack2, packet_req ,sack_packet: out std_logic);
end ras_packet;
architecture behv of ras_packet issignal ss_inv : std_logic;
begin -- behvss_inv <= not ss_alu8;packet_req <= (ss_alu8 and sreq_alu8 and req);ack2 <= ((ss_inv and req) or packet_ack);sack_packet <= packet_ack;
end behv;
Figure B.14. Source code of the RAS block with set-seq
73
This RAS block is similar to the one with a single set-seqbit and set-exe except for that it needs to wait for seq-reqcorresponding to the set-seq bit.
library IEEE;use IEEE.std_logic_1164.all;
entity ras_flow is
port (ss_cmp16, ss_camf, se, sreq_cmp16,sreq_camf, req, flow_ack : in std_logic;ack6, flow_req , sack_flow: out std_logic);
end ras_flow;
architecture behv of ras_flow issignal se_inv, ss_camf_inv, ss_cmp16_inv : std_logic;
begin -- behvse_inv <= not se;ss_camf_inv <= not ss_camf;ss_cmp16_inv <= not ss_cmp16;flow_req <= ((ss_camf_inv or sreq_camf) and
(ss_cmp16_inv or sreq_cmp16) and se and req);ack6 <= ((se_inv and req) or flow_ack);sack_flow <= flow_ack;
end behv;
Figure B.15. Source code of the RAS block with multiple set-seq bits and set-exe
REFERENCES
[1] G. A. Andreas Nowatzyk and F. Pong. Design of the S3MP processor. In Proc. ofEuropar Workshop, 1995.
[2] C. V. Berkel, M. Josephs, and S. Nowick. Scanning the technology- applicationsof asynchronous circuits. In Proc. of IEEE, volume 87, pages 234–242, February1999.
[3] E. Brunvand. Translating Concurrent Communicating Programs into Asyn-chronous Circuits. PhD thesis, Carnegie Mellon University, 1991.
[4] F. C. Cheng, S. H. Unger, and M. Theobald. Self-timed Carry-lookahead Adders.IEEE Transactions on Computers, 49(7):659–672, 2000.
[5] D. E. Comer. Network Systems Design using Network Processors. Prentice Hall,2003.
[6] A. Davis and S. M. Nowick. An introduction to asynchronous circuit design.Technical Report UUCS-97-013, School of Computing, University of Utah, Sept.1997.
[7] J. Escriba and J. A. Carrasco. Self-timed Manchester Chain Carry Propagate Adder.Electronic Letters, 32(8):708–710, 1996.
[8] C. P. et al. A 50-Gb/s ip router. IEEE/ACM Transactions on Networking, 6(3), June1998.
[9] S. B. Furber, J. D. Garside, and S. Temple. Power-saving features in Amulet2e. InIn Proc. of Power Driven Microarchitecture Workshop, June 1998.
[10] G. Gulati and E. Brunvand. Design of a cell library for asynchronous microengines.In Proc. of Great Lakes VLSI, 2005.
[11] Intel IXP12XX product line of network processors. http://www.intel.com/design/network/products/npfamily/ixp1200.htm.
[12] H. Jacobson and G. Gopalakrishnan. Application-specific asynchronous micro-engines for efficient high-level control. Technical Report UUCS-97-007, School ofComputing, University of Utah, 1997.
[13] H. Jacobson and G. Gopalakrishnan. Asynchronous microengines for high-levelcontrol. In Proc. of 17th Conf. on Advanced Research in VLSI(ARVLSI97), 1997.
75
[14] H. Jacobson and G. Gopalakrishnan. Application-specific programmable controlfor high-performance asynchronous circuits. In Proc. of IEEE in a special asyn-chronous issue, volume 92, February 1999.
[15] S. Keshav. An Engineering Approach to Computer Networking. Addison Wesley,1999.
[16] S. Keshav and R. Sharma. Issues and trends in router design. IEEE Communica-tions Magazine, 36(5):144–151, May 1998.
[17] E. Kohler. The Click Modular Router. PhD thesis, Dept. of Computer Science,MIT, 2000.
[18] E. Kohler, R. Morris, B. Chen, J. Jannotti, and M. F. Kaashoek. The click modularrouter. ACM Transactions on Computer Systems, 18(3):263–297, August 2000.
[19] J. Kuskin and D. O. et al. The stanford FLASH multiprocessor. In Proc. of 21stAnnual International Symposium on Computer Architecture, pages 302–313, 1994.
[20] A. J. Martin. Asynchronous Datapaths and the Design of an Asynchronous Adder.Formal Methods in System Design, 1(1):119–137, July 1992.
[21] C. Myers. Asynchronous Circuit Design. Wiley, 2001.
[22] L. Peterson and B. Davie. Computer Networks- A system’s approach. MorganKaufmann, 1999.
[23] P. A. Riocreux, L. E. M. Brackenbury, M. Cumpstey, and S. B. Furber. A Low-Power Self-Timed Viterbi Decoder. In In Proc. of ASYNC, pages 15–24, March2001.
[24] O. Salomon and H. Klar. Self-timed fully pipelined multipliers. In IFIP Transac-tions : Computer Science and Technology, volume A-28, pages 45–55, 1993.
[25] C. Seitz. Chapter 7 ”System Timing” Introduction to VLSI Systems(Mead andConway). Addison Wesley, 1980.
[26] N. Shah. Understanding Network Processors. Master’s thesis, Dept. of EECS,University of California, Berkeley, September 2001.
[27] T. Spalink, S. Karlin, L. Peterson, and Y. Gottlieb. Building a Robust Software-Based Router Using Network Processors. In Proc. of the 18th ACM Symposiumon Operating Systems Principles (SOSP), pages 216–229, Chateau Lake Louise,Banff, Alberta, Canada, October 2001.
[28] K. Stevens. The soft-controller: A self-timed microsequencer for distributedparallel architectures. Technical report, School of Computing, University of Utah,1984.
[29] W. R. Stevens. The Protocols(TCP/IP Illustrated Volume 1). Addison-Wesley,1994.
76
[30] I. E. Sutherland. Micropipelines. Communications of the ACM, 32(6):720–738,June 1989.
[31] N. Weste and D. Harris. Principles of CMOS VLSI Design: A System’s Perspective(3rd Edition). Addison-Wesley, 2005.
[32] Xilinx 2.5V Spartan II FPGA Family: Complete Datasheet. http://direct.xilinx.com/bvdocs/publications/ds001.pdf/.
[33] Xilinx Application Note:An Overview of Multiple CAM Designs in Virtex FamilyDevices. http://xilinx.com/bvdocs/appnotes/xapp201.pdf.
[34] K. Y. Yun. Synthesis of Asynchronous Controllers for Heterogenous Systems. PhDthesis, Stanford University, 1994.
Recommended