Upload
omar
View
215
Download
2
Embed Size (px)
Citation preview
External DDR2-Constrained NOC-Based 24-Processors MPSOC
Design and Implementation on Single FPGA
Abstract – Network on chip (NOC) has been proposed for the connection substrate of multiprocessor System on Chip (SoC) due to limited bandwidth of bus based solutions. Although some designs are emerging actual design experiences of NOC based multiprocessor system on chip remain scarce contrary to simulation based studies. However, implementation constraints clearly affects th design and modelling of a complex multiprocessor. In this paper we present the design and implementation of a 24-processors multiprocessor system with 24 processors under the constraints of limited access to 4 external DDR2 memory banks. All the processors and DDR2 memories are connected to a network on chip through Open Core Protocol (OCP) interface. Multiple clock domains result ing from various IP complexities requires Global Asynchronous Local Synchronous (GALS) design methodlogy which adds some extra area. The multiprocessor system is fully implemented on Xilinx Virtex-4 FX140 FPGA based board and uses about 90 % of the chip area.
I Introduction Multiprocessor Systems-on-Chip (MPSoCs) have become the standard for implementing embedded systems, the conventional interconnecting modes, such as buses and crossbars, cannot satisfy MPSOC’s requirements of performance, area as well as scalability, reliability. On-Chip Network (OCN), or Network on Chip (NoC), has been proposed as a systematic approach to deal with the communication-centric design challenge. Modular structure of NoC makes multiprocessor architecture highly scalable and improves reliability and operation frequency of on chip modules. Furthermore the NoC approach offers matchless opportunities for implementing Globally Asynchronous, Locally synchronous (GALS) design, which make clock distribution and timing closure problems more manageable. To reduce the pressure of time-to-market and tackle the increasing complexity of SoC, the need of fast prototyping and testing is growing. Taking advantage of deep submicron technology, modern FPGAs provide a fast and low-cost prototyping with large logic resources and high performance. Designer can quickly implement their system with the IP
library a provided by FPGA manufacturer. In this paper we present a multiprocessor system with twenty-four processing elements (PEs) and four DDR2 memory banks. All the PEs and memory controllers are connected with a scalable OCN developed with Arteris Danube library. The rest of paper is organized as follows. In Scetion II we describes our design and technology constraints and how it affects the archietcture of the MPSOC to be designed. In section III, we will introduce the overall MPSOC architecture, detail of PE and OCN connection. Section IV will discuss GALS clock strategy. The results of evaluation and implementation will be shown in section IV. Lastly we will conclude.
II. Architecture
A. External Constraints The multiprocessor to be designed have to target a single FPGA chip with external pins connections constraints. These external pins constraints reduces the possible number of external DDR modules to 4.
Fig. 1 MPSOC Target External Connections
This MPSOC target external connections should be implemented with the constraint of a flip-chip BGA package with around 750 maximum I/Os on Xilinx Virtex-4 FX technology. In this family of chips the FF1517 is the closest.
Zhoukun Wang Omar Hammami ENSTA ParisTech
32 Boulevard Victor 75739 Paris, France.
{zhoukun.wang, [email protected]}
Table 1 – Xilinx Virtex-4 Flip-chip packages
B. Overall Architecture With the constraints of 4 external DDR2 several possible MPSOC architecture are possible assuming that we are using Xilinx Microblaze processors on FX-140 chip. About 24 processors can be implementde which can be organized in several possible architectures described below. (a) (b) (c)
Fig. 2 Block diagram of 3 MPSOC architectures
Architecture (a) described in Fig. 2 is organized around a classical dance hall architecture with memory banks on one side and processors on the other side of NOC. A NOC is essential in this case as 24 processors are connected on the communication medium. Architecture (b) organizes the MPSOC around the DDR2 memory banks with the strong assumption of locality of references and computation with 8 processors attached to each DDR2 bank. Accesses to this DDR2 bank are through a high performance shared bus connected to the DDR2 for local accesses and to a NOC for distributed memories accesses. Finally the last architecture is based on a contral NOC with distributed memory banks. Its is clear that each architecture is sensitive to layout constraints. The Virtex-4 architecture adds an additonnal feature described in the following figure.
Fig. 3 Virtex-4 Internal Architecture
The central column places the DCM for clock tree and IOs in the center of the chip. Any IP not accessing should then be placed at the periphery of the layout which is the case of processor IPs while IPs accessing Ios such as DDR2 controllers be placed in the center. This eliminates architectures (b) and (c).
Fig. 4 Possible MPSOC Layout C. Selected Architecture
Fig. 5 MPSOC Architecture – details
NOC NOC
NOC
Processors layout
NOC Level1 layout
N O C 2
The block diagram of the overall multiprocessor architecture is illustrated in Fig.5. The multiprocessor system comprises 24 Processing Elements (PE), which can independently run their own program code and operating system. These MicroBlaze processor based PEs are connected to switches through OCP-to-NTTP Network Interface Units (NI). The OCP-to-NTTP NI, or called Master NI, translates OCP to our OCN protocol: Arteris NoC Transaction and Transport Protocol (NTTP). The switching system is connected to four NTTP-to-OCP NIs (Slave NI), which in turn connect to the respective DDR2 memory controller. Each DDR2 controller controls an off-chip DDR2 memory bank (256Mbytes).
D. Processing Element
Fig.6 MicroBlaze based Processing Element
To increase the compatibility and to ease the reutilization of the architecture, the OCP-IP standard is used for the connection of PEs and OCN. Benefiting from the OCP standard, any processor with OCP interface can be easily connected on our system. The MicroBlaze based computing system is integrated as a PE in our FPGA design. The Xilinx processing soft-core MicroBlaze V7.00 is a 32bit reduced instruction set computer (RISC) optimized for implementation in Xilinx® Field Programmable Gate Array (FPGA), and MricoBlaze processor IP and its memory connecting IPs are provided in the library of our FPGA design environment: Xilinx Embedded Development Kit (EDK).
MicroBlaze processor is implemented with Harvard memory architecture; instruction and data accesses are done in separate address spaces and it is highly configurable. A set of parameters can be configured at design time to fit design requirement, such as number of pipeline stages, cache size, interfaces and execution units like: selectable Barrel Shifter (BS), Floating Point Unit (FPU), hardware divider (HWD), hardware multiplier (HWM), Memory Management Unit (MMU). The performance and the maximum execution frequency vary depending on processor configuration. For
its communication purposes, MicroBlaze v7.00 offers a Processor Local Bus (PLB) interface and up to 16 Fast Simplex Link (FSL) interfaces which is a point to point FIFO-based communication channel.
As shown in Fig.2, the MicroBlaze is connected to its Instruction side Local Memory Bus (ILMB) controller and Data side Local Memory Bus (DLMB) controller through ILMB and DLMB respectively. Two memory controllers control 32KByte BRAM based local on-chip memory. As OCP interface is not provided by MicroBlaze, an OCP adapter, which can translate FSL interface to OCP interface, is integrated in PE subsystem for the connection with OCN.
E. OCN
Our On-Chip Network connection system is developed with the NoCComplier and the Danube library from the Arteris. The OCN is comprised of a request portion and a response portion, and the request and response transactions are exchanged between Master NI and Slave NI. The OCN protocol, NTTP, is a three-layered approach comprising transaction, transport and physical layers. NTTP uses the packet-based wormhole scheduling technique. As shown in Fig.3, the request packets are comprised of three different cells: a header cell, a necker cell and possibly one or more data cells. The header cells contain information relative to routing, payload size, packet type, and the packet target address. The necker cell provides detailed addressing information of the target. The necker cell is not needed in response packet. The transaction layer is compatible with bus-based transaction protocol implemented in NIs. NI translates third-party protocols to NTTP at the boundary of NoC. We used OCP-to-NTTP NIs to convert OCP 2.2 protocol to NTTP protocol. The OCP basic signals, burst extensions signals and “MFlag” signal are used, as listed in Table1. The data width of MData and SDatae are 64bits. The OCP initiator can optionally associate a pressure level to requests in order to indicate service priorities at arbitration points. The pressure-level is passed to the NoC via the “MFlag” input signal, and applies to the “Pressure” field in the packet header cell, as well as the “press” signals in the physical link layer. The locked synchronization is support in our OCN. OCP initiator can use the ReadExcusive (ReadEX) command and Write or WriteNonpost command to perform a read-modify-write atomic transaction. NI sends a Lock request packet when it receives the ReadEX command. The Lock request locks the path from OCP master to the OCP slave. During the locked period, the other Masters cannot access the locked slave until the OCP master that requested ReadEX send Write or WritrNonPost command to unlock the path.
As shown in Fig.5, the OCN is a cascading multistage interconnection network (MIN), which contains 8 switches for request as well as 8 switches for response. Twenty-four OCP-to-NTTP NIs and four NTTP-to-OCP NIs are integrated at the boundary of OCN. The OCP-to-NTTP NI converts the OCP master interface to NTTP interface and connects PE to first stage switches. First stage switches are comprised of three 8*4 switches, while the second stage contains four 3*1 switches. Each output port of switch in second stage is connected to a NTTP-to-OCP NI, which in turn connects to DDR2 memory controller.
III. Bi-Synchronous FIFO in GALS architecture
To improve the performance and reduce the power consumption of system, the GALS approach is adopted by using Bi-Synchronization method in our design. The GALS approach has been proposed to solve the timing closure problem in deep sub-micron processes by partitioning the SoC into isolated synchronous subsystems that hold own independent frequency. To tackle the communication issue between two different clock domain, The Bi-synchronous FSL has been integrated between MicroBlaze and OCP Adapter (shown in Fig 10).
The FIFO based Bi-synchronous FSL makes PE and OCN as
isolated synchronous islands with independent clock frequencies. Each Bi-synchronous FIFO has two clock inputs: M_CLK and S_CLK. The master of FSL operates at the frequency of M_CLK, while the slave of FSL runs at the frequency of S_CLK. In our FPGA design, Digital Clock Managers (DCM) generate different frequencies for each clock island.
IV. Evaluation and implementation
The described multiprocessor have been implemented on a Xilinx FX-140 FPGA chip using Xilinx EDA tools.
Fig 8 Floorplan of Multiprocessor
Resource percentage Number of RAMB16s 384 out of 552 69% Number of DSP48s 72 out of 192 37% Number of Slices 55266 out of 63168 87%
Table 3 - FPGA resource utilization
Fig. 9 – FPGA Board
Fig 7 Block diagram of OCP adapter
Function MCmd master Transfer command basic MAddr master Transfer address basic MBurstLen master Burst length burst MData master Write data basic MDataValid master Write data valid basic MDataLast master Last write data in burst burst MRespAcc master accepts response basic MFlag master flag for pressure level press SCmdAcc Slave accepts transfer basic SDataAcc Slave accepts write data basic SResp slave Transfer response basic SData slave Read data basic SRespLast slave Last response in burst burst
Table 2 - Signals for our OCP interface
V. Conclusions
We have described the design and implementation of a 24-processors multiprocessor with network on chip connection on a single FPGA chip. External constraints such as the number of pins reduces the number of external DDR2 modules which can be connected and used for parallel processing.
References 1. ITRS http://www.itrs.net 2. A.A.Jerraya and all, "Multiprocessor Systems-on-Chip",
Morgan Kaufman Pub, 2004 3. OCP-IP OCP-IPOpenCoreProtocolSpecification2.2.pdf
http://www.ocpip.org/home, 2008 4. Arteris S.A http://www.arteris.com/ 5. NoC Solution 1.12, NoC NTTP technical reference, o3446v8,
April 2008 6. Arteris Danube 1.12, Packet Transport Units technical
reference, o4277v11, April 2008 7. Alpha-data ADPe-XRC-4 FPGA card
http://www.alpha-data.com/adpe-xrc-4.html 8. Xilinx Virtex-4
http://www.xilinx.com/products/silicon_solutions/fpgas/virtex/virtex4/index.htm
9. Xilinx EDK 9.2 http://www.xilinx.com/ise/embedded/edk_docs.htm
10. Xilinx ISE 9.2 http://www.xilinx.com/ise/logic_design_prod/foundation.htm