5
External DDR2-Constrained NOC-Based 24-Processors MPSOC Design and Implementation on Single FPGA Abstract – Network on chip (NOC) has been proposed for the connection substrate of multiprocessor System on Chip (SoC) due to limited bandwidth of bus based solutions. Although some designs are emerging actual design experiences of NOC based multiprocessor system on chip remain scarce contrary to simulation based studies. However, implementation constraints clearly affects th design and modelling of a complex multiprocessor. In this paper we present the design and implementation of a 24-processors multiprocessor system with 24 processors under the constraints of limited access to 4 external DDR2 memory banks. All the processors and DDR2 memories are connected to a network on chip through Open Core Protocol (OCP) interface. Multiple clock domains result ing from various IP complexities requires Global Asynchronous Local Synchronous (GALS) design methodlogy which adds some extra area. The multiprocessor system is fully implemented on Xilinx Virtex-4 FX140 FPGA based board and uses about 90 % of the chip area. I Introduction Multiprocessor Systems-on-Chip (MPSoCs) have become the standard for implementing embedded systems, the conventional interconnecting modes, such as buses and crossbars, cannot satisfy MPSOC’s requirements of performance, area as well as scalability, reliability. On-Chip Network (OCN), or Network on Chip (NoC), has been proposed as a systematic approach to deal with the communication-centric design challenge. Modular structure of NoC makes multiprocessor architecture highly scalable and improves reliability and operation frequency of on chip modules. Furthermore the NoC approach offers matchless opportunities for implementing Globally Asynchronous, Locally synchronous (GALS) design, which make clock distribution and timing closure problems more manageable. To reduce the pressure of time-to-market and tackle the increasing complexity of SoC, the need of fast prototyping and testing is growing. Taking advantage of deep submicron technology, modern FPGAs provide a fast and low-cost prototyping with large logic resources and high performance. Designer can quickly implement their system with the IP library a provided by FPGA manufacturer. In this paper we present a multiprocessor system with twenty-four processing elements (PEs) and four DDR2 memory banks. All the PEs and memory controllers are connected with a scalable OCN developed with Arteris Danube library. The rest of paper is organized as follows. In Scetion II we describes our design and technology constraints and how it affects the archietcture of the MPSOC to be designed. In section III, we will introduce the overall MPSOC architecture, detail of PE and OCN connection. Section IV will discuss GALS clock strategy. The results of evaluation and implementation will be shown in section IV. Lastly we will conclude. II. Architecture A. External Constraints The multiprocessor to be designed have to target a single FPGA chip with external pins connections constraints. These external pins constraints reduces the possible number of external DDR modules to 4. Fig. 1 MPSOC Target External Connections This MPSOC target external connections should be implemented with the constraint of a flip-chip BGA package with around 750 maximum I/Os on Xilinx Virtex-4 FX technology. In this family of chips the FF1517 is the closest. Zhoukun Wang Omar Hammami ENSTA ParisTech 32 Boulevard Victor 75739 Paris, France. {zhoukun.wang, [email protected] }

[IEEE 2008 3rd International Design and Test Workshop (IDT) - Monastir, Tunisia (2008.12.20-2008.12.22)] 2008 3rd International Design and Test Workshop - External DDR2-constrained

  • Upload
    omar

  • View
    215

  • Download
    2

Embed Size (px)

Citation preview

External DDR2-Constrained NOC-Based 24-Processors MPSOC

Design and Implementation on Single FPGA

Abstract – Network on chip (NOC) has been proposed for the connection substrate of multiprocessor System on Chip (SoC) due to limited bandwidth of bus based solutions. Although some designs are emerging actual design experiences of NOC based multiprocessor system on chip remain scarce contrary to simulation based studies. However, implementation constraints clearly affects th design and modelling of a complex multiprocessor. In this paper we present the design and implementation of a 24-processors multiprocessor system with 24 processors under the constraints of limited access to 4 external DDR2 memory banks. All the processors and DDR2 memories are connected to a network on chip through Open Core Protocol (OCP) interface. Multiple clock domains result ing from various IP complexities requires Global Asynchronous Local Synchronous (GALS) design methodlogy which adds some extra area. The multiprocessor system is fully implemented on Xilinx Virtex-4 FX140 FPGA based board and uses about 90 % of the chip area.

I Introduction Multiprocessor Systems-on-Chip (MPSoCs) have become the standard for implementing embedded systems, the conventional interconnecting modes, such as buses and crossbars, cannot satisfy MPSOC’s requirements of performance, area as well as scalability, reliability. On-Chip Network (OCN), or Network on Chip (NoC), has been proposed as a systematic approach to deal with the communication-centric design challenge. Modular structure of NoC makes multiprocessor architecture highly scalable and improves reliability and operation frequency of on chip modules. Furthermore the NoC approach offers matchless opportunities for implementing Globally Asynchronous, Locally synchronous (GALS) design, which make clock distribution and timing closure problems more manageable. To reduce the pressure of time-to-market and tackle the increasing complexity of SoC, the need of fast prototyping and testing is growing. Taking advantage of deep submicron technology, modern FPGAs provide a fast and low-cost prototyping with large logic resources and high performance. Designer can quickly implement their system with the IP

library a provided by FPGA manufacturer. In this paper we present a multiprocessor system with twenty-four processing elements (PEs) and four DDR2 memory banks. All the PEs and memory controllers are connected with a scalable OCN developed with Arteris Danube library. The rest of paper is organized as follows. In Scetion II we describes our design and technology constraints and how it affects the archietcture of the MPSOC to be designed. In section III, we will introduce the overall MPSOC architecture, detail of PE and OCN connection. Section IV will discuss GALS clock strategy. The results of evaluation and implementation will be shown in section IV. Lastly we will conclude.

II. Architecture

A. External Constraints The multiprocessor to be designed have to target a single FPGA chip with external pins connections constraints. These external pins constraints reduces the possible number of external DDR modules to 4.

Fig. 1 MPSOC Target External Connections

This MPSOC target external connections should be implemented with the constraint of a flip-chip BGA package with around 750 maximum I/Os on Xilinx Virtex-4 FX technology. In this family of chips the FF1517 is the closest.

Zhoukun Wang Omar Hammami ENSTA ParisTech

32 Boulevard Victor 75739 Paris, France.

{zhoukun.wang, [email protected]}

Table 1 – Xilinx Virtex-4 Flip-chip packages

B. Overall Architecture With the constraints of 4 external DDR2 several possible MPSOC architecture are possible assuming that we are using Xilinx Microblaze processors on FX-140 chip. About 24 processors can be implementde which can be organized in several possible architectures described below. (a) (b) (c)

Fig. 2 Block diagram of 3 MPSOC architectures

Architecture (a) described in Fig. 2 is organized around a classical dance hall architecture with memory banks on one side and processors on the other side of NOC. A NOC is essential in this case as 24 processors are connected on the communication medium. Architecture (b) organizes the MPSOC around the DDR2 memory banks with the strong assumption of locality of references and computation with 8 processors attached to each DDR2 bank. Accesses to this DDR2 bank are through a high performance shared bus connected to the DDR2 for local accesses and to a NOC for distributed memories accesses. Finally the last architecture is based on a contral NOC with distributed memory banks. Its is clear that each architecture is sensitive to layout constraints. The Virtex-4 architecture adds an additonnal feature described in the following figure.

Fig. 3 Virtex-4 Internal Architecture

The central column places the DCM for clock tree and IOs in the center of the chip. Any IP not accessing should then be placed at the periphery of the layout which is the case of processor IPs while IPs accessing Ios such as DDR2 controllers be placed in the center. This eliminates architectures (b) and (c).

Fig. 4 Possible MPSOC Layout C. Selected Architecture

Fig. 5 MPSOC Architecture – details

NOC NOC

NOC

Processors layout

NOC Level1 layout

N O C 2

The block diagram of the overall multiprocessor architecture is illustrated in Fig.5. The multiprocessor system comprises 24 Processing Elements (PE), which can independently run their own program code and operating system. These MicroBlaze processor based PEs are connected to switches through OCP-to-NTTP Network Interface Units (NI). The OCP-to-NTTP NI, or called Master NI, translates OCP to our OCN protocol: Arteris NoC Transaction and Transport Protocol (NTTP). The switching system is connected to four NTTP-to-OCP NIs (Slave NI), which in turn connect to the respective DDR2 memory controller. Each DDR2 controller controls an off-chip DDR2 memory bank (256Mbytes).

D. Processing Element

Fig.6 MicroBlaze based Processing Element

To increase the compatibility and to ease the reutilization of the architecture, the OCP-IP standard is used for the connection of PEs and OCN. Benefiting from the OCP standard, any processor with OCP interface can be easily connected on our system. The MicroBlaze based computing system is integrated as a PE in our FPGA design. The Xilinx processing soft-core MicroBlaze V7.00 is a 32bit reduced instruction set computer (RISC) optimized for implementation in Xilinx® Field Programmable Gate Array (FPGA), and MricoBlaze processor IP and its memory connecting IPs are provided in the library of our FPGA design environment: Xilinx Embedded Development Kit (EDK).

MicroBlaze processor is implemented with Harvard memory architecture; instruction and data accesses are done in separate address spaces and it is highly configurable. A set of parameters can be configured at design time to fit design requirement, such as number of pipeline stages, cache size, interfaces and execution units like: selectable Barrel Shifter (BS), Floating Point Unit (FPU), hardware divider (HWD), hardware multiplier (HWM), Memory Management Unit (MMU). The performance and the maximum execution frequency vary depending on processor configuration. For

its communication purposes, MicroBlaze v7.00 offers a Processor Local Bus (PLB) interface and up to 16 Fast Simplex Link (FSL) interfaces which is a point to point FIFO-based communication channel.

As shown in Fig.2, the MicroBlaze is connected to its Instruction side Local Memory Bus (ILMB) controller and Data side Local Memory Bus (DLMB) controller through ILMB and DLMB respectively. Two memory controllers control 32KByte BRAM based local on-chip memory. As OCP interface is not provided by MicroBlaze, an OCP adapter, which can translate FSL interface to OCP interface, is integrated in PE subsystem for the connection with OCN.

E. OCN

Our On-Chip Network connection system is developed with the NoCComplier and the Danube library from the Arteris. The OCN is comprised of a request portion and a response portion, and the request and response transactions are exchanged between Master NI and Slave NI. The OCN protocol, NTTP, is a three-layered approach comprising transaction, transport and physical layers. NTTP uses the packet-based wormhole scheduling technique. As shown in Fig.3, the request packets are comprised of three different cells: a header cell, a necker cell and possibly one or more data cells. The header cells contain information relative to routing, payload size, packet type, and the packet target address. The necker cell provides detailed addressing information of the target. The necker cell is not needed in response packet. The transaction layer is compatible with bus-based transaction protocol implemented in NIs. NI translates third-party protocols to NTTP at the boundary of NoC. We used OCP-to-NTTP NIs to convert OCP 2.2 protocol to NTTP protocol. The OCP basic signals, burst extensions signals and “MFlag” signal are used, as listed in Table1. The data width of MData and SDatae are 64bits. The OCP initiator can optionally associate a pressure level to requests in order to indicate service priorities at arbitration points. The pressure-level is passed to the NoC via the “MFlag” input signal, and applies to the “Pressure” field in the packet header cell, as well as the “press” signals in the physical link layer. The locked synchronization is support in our OCN. OCP initiator can use the ReadExcusive (ReadEX) command and Write or WriteNonpost command to perform a read-modify-write atomic transaction. NI sends a Lock request packet when it receives the ReadEX command. The Lock request locks the path from OCP master to the OCP slave. During the locked period, the other Masters cannot access the locked slave until the OCP master that requested ReadEX send Write or WritrNonPost command to unlock the path.

As shown in Fig.5, the OCN is a cascading multistage interconnection network (MIN), which contains 8 switches for request as well as 8 switches for response. Twenty-four OCP-to-NTTP NIs and four NTTP-to-OCP NIs are integrated at the boundary of OCN. The OCP-to-NTTP NI converts the OCP master interface to NTTP interface and connects PE to first stage switches. First stage switches are comprised of three 8*4 switches, while the second stage contains four 3*1 switches. Each output port of switch in second stage is connected to a NTTP-to-OCP NI, which in turn connects to DDR2 memory controller.

III. Bi-Synchronous FIFO in GALS architecture

To improve the performance and reduce the power consumption of system, the GALS approach is adopted by using Bi-Synchronization method in our design. The GALS approach has been proposed to solve the timing closure problem in deep sub-micron processes by partitioning the SoC into isolated synchronous subsystems that hold own independent frequency. To tackle the communication issue between two different clock domain, The Bi-synchronous FSL has been integrated between MicroBlaze and OCP Adapter (shown in Fig 10).

The FIFO based Bi-synchronous FSL makes PE and OCN as

isolated synchronous islands with independent clock frequencies. Each Bi-synchronous FIFO has two clock inputs: M_CLK and S_CLK. The master of FSL operates at the frequency of M_CLK, while the slave of FSL runs at the frequency of S_CLK. In our FPGA design, Digital Clock Managers (DCM) generate different frequencies for each clock island.

IV. Evaluation and implementation

The described multiprocessor have been implemented on a Xilinx FX-140 FPGA chip using Xilinx EDA tools.

Fig 8 Floorplan of Multiprocessor

Resource percentage Number of RAMB16s 384 out of 552 69% Number of DSP48s 72 out of 192 37% Number of Slices 55266 out of 63168 87%

Table 3 - FPGA resource utilization

Fig. 9 – FPGA Board

Fig 7 Block diagram of OCP adapter

Function MCmd master Transfer command basic MAddr master Transfer address basic MBurstLen master Burst length burst MData master Write data basic MDataValid master Write data valid basic MDataLast master Last write data in burst burst MRespAcc master accepts response basic MFlag master flag for pressure level press SCmdAcc Slave accepts transfer basic SDataAcc Slave accepts write data basic SResp slave Transfer response basic SData slave Read data basic SRespLast slave Last response in burst burst

Table 2 - Signals for our OCP interface

V. Conclusions

We have described the design and implementation of a 24-processors multiprocessor with network on chip connection on a single FPGA chip. External constraints such as the number of pins reduces the number of external DDR2 modules which can be connected and used for parallel processing.

References 1. ITRS http://www.itrs.net 2. A.A.Jerraya and all, "Multiprocessor Systems-on-Chip",

Morgan Kaufman Pub, 2004 3. OCP-IP OCP-IPOpenCoreProtocolSpecification2.2.pdf

http://www.ocpip.org/home, 2008 4. Arteris S.A http://www.arteris.com/ 5. NoC Solution 1.12, NoC NTTP technical reference, o3446v8,

April 2008 6. Arteris Danube 1.12, Packet Transport Units technical

reference, o4277v11, April 2008 7. Alpha-data ADPe-XRC-4 FPGA card

http://www.alpha-data.com/adpe-xrc-4.html 8. Xilinx Virtex-4

http://www.xilinx.com/products/silicon_solutions/fpgas/virtex/virtex4/index.htm

9. Xilinx EDK 9.2 http://www.xilinx.com/ise/embedded/edk_docs.htm

10. Xilinx ISE 9.2 http://www.xilinx.com/ise/logic_design_prod/foundation.htm