6
Improving Reliability in NoCs by Application-Specific Mapping Combined with Adaptive Fault-Tolerant Method in the Links Anelise Kologeski, Caroline Concatto, Luigi Carro and Fernanda Lima Kastensmidt UFRGS – Universidade Federal do Rio Grande do Sul (PGMICRO, PPGC) - Porto Alegre, Brazil {alkologeski, cconcatto, carro, fglima}@inf.ufrgs.br Abstract - A strategy to handle multiple defects in the NoC links with almost no impact on the communication delay is presented. The fault-tolerant method can guarantee the functionally of the NoC with multiple defects in any link, and with multiple faulty links. The proposed technique uses information from test phase to map the application and to configure fault-tolerant features along the NoC links. Results from an application remapped in the NoC show that the communication delay is almost unaffected, with minimal impact and overhead when compared to a fault-free system. We also show that our proposal has a variable impact in performance while traditional fault-tolerant solution like Hamming Code has a constant impact. Besides our proposal can save among 15% to 100% the energy when compared Hamming Code. Keywords - Adaptive Routing; Data Splitting; Fault Tolerance; Links; Mapping; NoCs. I. INTRODUCTION The use of fault-tolerance structures in Network-on-Chips (NoCs) is becoming mandatory in new generations of Multiple Processors System-on-Chip (MPSoC), due to the fact that it is almost impossible to manufacture integrated circuits without any defect in nanometer technologies [1]. As a result, the use of fault tolerant methods is crucial to allow that circuits with some amount of defects still reach the market, increasing yield and the lifetime of a chip. A classical example comes from DRAM circuits, where defects are compensated by the use of spare rows and columns. However, the number of expected defects in high-density circuits is increasing, and fault-tolerant techniques able to detect and correct multiple faults are very expensive, in terms of area, power and performance. Aiming at reducing the overhead cost, fault tolerance features should be turn on only in the exact location of defects. In this way, fault tolerance structures would not penalize the circuit in power and/or performance when faults are not present. Defects can have permanent effect, such as stuck-at, shortcut or open signals, or still show intermittent effects, like crosstalk between interconnection lines. For the above mentioned classes of defects, detection and diagnoses can be developed during manufacturing test, and off-line tests also can run during the life time of the circuit [2, 3 and 4]. So, with the information of fault locations, a mechanism to deactivate a defective component and turn on a fault tolerance feature can be used. For example, in a homogeneous MPSOC, once a defect or failure has been detected in a microprocessor, the software application can be mapped on the remaining hardware components of a multiprocessor circuit [5]. Unfortunately, this simple deactivation approach cannot deal with faulty router or faulty link in the NoC, unless the NoC is modified to be able to adapt itself in the presence of faults. In this way, we propose a method for fault tolerance in NoCs links, able to cope with multiple defects including permanent and crosstalk faults. After the location of the defect, the application can be remapped in order to minimize the impact of the application throughput. The proposed method uses an adaptive routing strategy combined with data splitting. This method does not require any extra wires in the links, and it causes minimal performance penalty and power consumption overhead on the NoC when compared to other popular fault tolerant solution. By using the test information it is possible to use the best fault tolerant configuration, taking into account also the best mapping for the fault location. This allows the MPSoC to deal with multiple faults in links and also multiple faulty-links, with minimal extra costs. Results confirm the advantage on combining the test results with adaptive fault tolerance techniques and application mapping to achieve robustness and performance with minimal overhead, and great connectivity. Related solutions presented in the literature usually cope with single faults by using Hamming Code (HC) [4, 6 and 7]. However, solutions based on error detection and correction codes (EDAC) imply in extra wires for parity bits and extra hardware for coding and decoding in each NoC link, with obvious impact on the latency time. Moreover, they can deal only with one fault per link and not multiple faults. Besides, extra hardware and wires embedded in the design consume extra power, regardless of the defect distribution or crosstalk occurrence, thus compromising the overall power and energy dissipation. Given this approach, our work also combines the proposed method with application mapping, showing that the impact on communication delay can be optimized considerably with a simple variation of the original task mapping. This paper is organized as follows. Related works are discussed in section II. The proposed fault tolerant approach and remapping proposal are shown in section III. In section IV results in synthesis, performance and connectivity are presented. Finally, conclusions are exposed in section V. II. RELATED WORKS Recent related works have proposed solutions to tolerate permanent and crosstalk faults in NoC links. Authors in [4] propose a technique that uses Hamming Code to protect all NoC links. Reported results show an area overhead around 50%, and a frequency penalty of 32% for 180nm technology. The technique proposed in [6] uses parity check, data splitting and retransmission of data to protect the links against crosstalk and permanent faults for 180nm technology. The authors propose the use of parity check to discover a faulty link. In the presence of faults, the erroneous half of the data is doubled and retransmitted. Due to the required retransmission the performance penalty in [6] occurs only in the presence of a fault. The main disadvantage of this method is the use of extra wires and the area overhead compared to a router protected with HC only. The work in [6] is similar to [7], except for changing parity by Hamming. However, [7] leads to a final area four times larger than the no-protected router and the latency is increased around four times too. The techniques presented in [4, 6 and 7] can cope with multiple faulty links, Sixteenth IEEE European Test Symposium 1530-1877/11 $26.00 © 2011 IEEE DOI 10.1109/ETS.2011.62 123

[IEEE 2011 16th IEEE European Test Symposium (ETS) - Trondheim, Norway (2011.05.23-2011.05.27)] 2011 Sixteenth IEEE European Test Symposium - Improving Reliability in NoCs by Application-Specific

Embed Size (px)

Citation preview

Improving Reliability in NoCs by Application-Specific Mapping Combined with Adaptive Fault-Tolerant Method in the Links Anelise Kologeski, Caroline Concatto, Luigi Carro and Fernanda Lima Kastensmidt

UFRGS – Universidade Federal do Rio Grande do Sul (PGMICRO, PPGC) - Porto Alegre, Brazil {alkologeski, cconcatto, carro, fglima}@inf.ufrgs.br

Abstract - A strategy to handle multiple defects in the NoC links with almost no impact on the communication delay is presented. The fault-tolerant method can guarantee the functionally of the NoC with multiple defects in any link, and with multiple faulty links. The proposed technique uses information from test phase to map the application and to configure fault-tolerant features along the NoC links. Results from an application remapped in the NoC show that the communication delay is almost unaffected, with minimal impact and overhead when compared to a fault-free system. We also show that our proposal has a variable impact in performance while traditional fault-tolerant solution like Hamming Code has a constant impact. Besides our proposal can save among 15% to 100% the energy when compared Hamming Code.

Keywords - Adaptive Routing; Data Splitting; Fault Tolerance; Links; Mapping; NoCs.

I. INTRODUCTION The use of fault-tolerance structures in Network-on-Chips

(NoCs) is becoming mandatory in new generations of Multiple Processors System-on-Chip (MPSoC), due to the fact that it is almost impossible to manufacture integrated circuits without any defect in nanometer technologies [1]. As a result, the use of fault tolerant methods is crucial to allow that circuits with some amount of defects still reach the market, increasing yield and the lifetime of a chip. A classical example comes from DRAM circuits, where defects are compensated by the use of spare rows and columns.

However, the number of expected defects in high-density circuits is increasing, and fault-tolerant techniques able to detect and correct multiple faults are very expensive, in terms of area, power and performance. Aiming at reducing the overhead cost, fault tolerance features should be turn on only in the exact location of defects. In this way, fault tolerance structures would not penalize the circuit in power and/or performance when faults are not present.

Defects can have permanent effect, such as stuck-at, shortcut or open signals, or still show intermittent effects, like crosstalk between interconnection lines. For the above mentioned classes of defects, detection and diagnoses can be developed during manufacturing test, and off-line tests also can run during the life time of the circuit [2, 3 and 4]. So, with the information of fault locations, a mechanism to deactivate a defective component and turn on a fault tolerance feature can be used. For example, in a homogeneous MPSOC, once a defect or failure has been detected in a microprocessor, the software application can be mapped on the remaining hardware components of a multiprocessor circuit [5]. Unfortunately, this simple deactivation approach cannot deal with faulty router or faulty link in the NoC, unless the NoC is modified to be able to adapt itself in the presence of faults.

In this way, we propose a method for fault tolerance in NoCs links, able to cope with multiple defects including permanent and crosstalk faults. After the location of the defect, the application can be remapped in order to minimize the impact of the application throughput. The proposed

method uses an adaptive routing strategy combined with data splitting. This method does not require any extra wires in the links, and it causes minimal performance penalty and power consumption overhead on the NoC when compared to other popular fault tolerant solution. By using the test information it is possible to use the best fault tolerant configuration, taking into account also the best mapping for the fault location. This allows the MPSoC to deal with multiple faults in links and also multiple faulty-links, with minimal extra costs. Results confirm the advantage on combining the test results with adaptive fault tolerance techniques and application mapping to achieve robustness and performance with minimal overhead, and great connectivity.

Related solutions presented in the literature usually cope with single faults by using Hamming Code (HC) [4, 6 and 7]. However, solutions based on error detection and correction codes (EDAC) imply in extra wires for parity bits and extra hardware for coding and decoding in each NoC link, with obvious impact on the latency time. Moreover, they can deal only with one fault per link and not multiple faults. Besides, extra hardware and wires embedded in the design consume extra power, regardless of the defect distribution or crosstalk occurrence, thus compromising the overall power and energy dissipation. Given this approach, our work also combines the proposed method with application mapping, showing that the impact on communication delay can be optimized considerably with a simple variation of the original task mapping.

This paper is organized as follows. Related works are discussed in section II. The proposed fault tolerant approach and remapping proposal are shown in section III. In section IV results in synthesis, performance and connectivity are presented. Finally, conclusions are exposed in section V.

II. RELATED WORKS

Recent related works have proposed solutions to tolerate permanent and crosstalk faults in NoC links. Authors in [4] propose a technique that uses Hamming Code to protect all NoC links. Reported results show an area overhead around 50%, and a frequency penalty of 32% for 180nm technology.

The technique proposed in [6] uses parity check, data splitting and retransmission of data to protect the links against crosstalk and permanent faults for 180nm technology. The authors propose the use of parity check to discover a faulty link. In the presence of faults, the erroneous half of the data is doubled and retransmitted. Due to the required retransmission the performance penalty in [6] occurs only in the presence of a fault. The main disadvantage of this method is the use of extra wires and the area overhead compared to a router protected with HC only. The work in [6] is similar to [7], except for changing parity by Hamming. However, [7] leads to a final area four times larger than the no-protected router and the latency is increased around four times too. The techniques presented in [4, 6 and 7] can cope with multiple faulty links,

Sixteenth IEEE European Test Symposium

1530-1877/11 $26.00 © 2011 IEEE

DOI 10.1109/ETS.2011.62

123

but not with multiple faults per link. Parity check, HC, voltage scaling and channel duplication have been used in [15] to protect the links and to reduce costs. Thus, [15] can protect the NoC with an area overhead of 22% and 100% more links, although with the voltage scaling it can save 46,6% in power consumption when compared with unprotected NoC. But only triple-error correction and quadruple-error detection are possible, while our proposal allows cope with up 50% of faulty link.

The work in [16] proposes to use partially faulty links when the traffic in the network is high. The idea is to distribute the traffic uniformly on links. The link capacity can be split in 25%, 50%, 75%, and 100%, according to the faults in the link. This proposal has power consumption overhead between 5% and 8% and an area overhead between 15% to 21%. However, [16] considers that all faults are concentrated within the same group of wires, nevertheless faults can be distributed among the link.

Other related techniques to tolerate faults in NoC links are based in knowing the location of the faulty link from test. The works presented in [8] and [9] use adaptive routing to avoid faulty links and routers, which implies in a relative low latency overhead. However, they use virtual channels and/or memory tables to avoid deadlock in the network, which are normally synonymous of area overhead and power consumption. In [3] the authors propose a partially adaptive routing strategy to cope faulty links based on the minimal change in the XY path. Consequently, virtual channels and tables are not used, and the technique in [3] has a smaller area overhead. However, because the routing is only partially adaptive, it is not always possible to find an alternative faulty-free path, especially in the presence of multiple faults. Results in [3] have shown that by using only adaptive routing, 34% of faulty links are unprotected. Moreover these techniques [3,8-9] cannot cope with faulty links between core and router.

The works proposed in [13] and [14] combine mapping and adaptive routing to increase reliability in NoCs. Both works present a mapping strategy that concurrently take into account the application core graph, the fault probability in the links and the routing. Their goal is to obtain the Pareto set of mapping configurations with customized routing functions that minimize the average latency and maximize the reliability of the application. Both proposals use the same routing algorithm (APSRA), and do not cover faults between the connections from core-to-router and router-to-core. The difference between [13] and [14] is the mapping algorithm. In the presence of two faulty links, [14] can cope with 96% of the cases in a 4x3 mesh topology, whereas our proposal can handle 100% of the faulty. Moreover, we can solve the problem caused by faulty links between core and router, while in [13-14] it is not possible, reducing their efficiency to 65%.

As our work uses the information about the fault location together with the chosen fault tolerant solution, we can change the application mapping and expect a better behavior, while [13-14] make several computations in the design phase to find the best mapping for a given fault probability.

III. THE FAULT-TOLERANT STRATEGY

We propose a technique that copes with single and multiple faults (permanent and crosstalk) in NoC links and multiple faulty links, by using adaptive routing, data splitting (in those cases when an alternative path cannot be found by

adaptive routing) and mapping. Our strategy is used after the test phase, which is done off-line, hence the application does not have penalties in performance due to time to recover. The combination of these three approaches avoids the need of additional wires in the links, minimizes additional hardware in the critical path and increases throughput when compared with the original mapping with faults. Consequently, minimal impacts are expected to make a NoC tolerate multiple defects in the links. In addition, in the presence of multiple faults is possible to have 100% of connectivity in the best case by combining adaptive routing and data splitting,

The proposed technique has been implemented in the case-study NoC [11] with 2D-torus topology. The parametric router architecture has been implemented in VHDL and has been named ARDS (Adaptive Routing with Data Splitting). It uses routing switches with up to five bidirectional ports (Local, North, South, West and East), each port with two unidirectional channel links. Each router is connected to four neighbor routers (North, South, East and West) and to a core or processor element (Local). The architecture uses the wormhole switching approach and deterministic source based in routing algorithm. XY-routing is used. Each packet consists of a first flit containing the header with destination address followed by a set of payload flits and a tail flit.

In the 2D-torus topology of size mxn and XY routing algorithm, a packet has two possible routes in the same dimension: it may go k steps to one way (positive) or m-k (or n-k) steps to the other way (negative). However, a packet travels no more than m-1 or n-1 steps from source to destination when m or n are odd, or only m or n steps when they are even. Note that this flexibility on the routing can be seen as a feature to avoid faulty links.

In order to cite the links, we assumed that RR_link is a link between two routers, and CR_link is a link between core and router. Faults may happen among any links of the network, and are classified as shown below and in Figure 1:

1. Intralink faults: they happen when aggressors and victims are into the same link. So, they can occur isolated in a RR_link and/or in a CR_link, and each intralink fault is not associated with other links.

2. Interlink faults: they happen when aggressors and victims are in different links. Thus, each interlink fault occur between two CR_links, or between two RR_links, or between a RR_link and a CR_link. Multiple defects can be any combination of intralink and interlink faults.

IDCT

Intralink fault (CR_link)

Interlink fault(CR_link and RR_link)

Intralink fault (RR_link)

Interlink fault (CR_link)

Interlink fault (RR_link)

R1 R2 R3

R4 R5 R6

R7 R8 R9

VU SDRAM RAST

MED CPU

UP SAMP

SRAM1

RR_link

CR_link

R10 R11 R12

SRAM2 RISC

AU BAB ADSP

Figure 1. Mpeg4 mapped in 3x4 2D-torus NoC with an example of faults

among NoC links.

124

In order to cope with defects, first the proposed fault-tolerant method attempts to use alternative paths to avoid the faulty-links by using adaptive routing proposed in [3]. For that, each router is configured with the test results about faulty-links, and a 10-bit register is added in each ARDS router to be programmed by scan-chains. So, each router knows if one or more of its communication channels (L, N, S, W, and E for input and output channels) are faulty. The bits corresponding to faulty channels are set to ‘1’, and the routing algorithm checks these bits before forwarding the header of a packet. If the contemplated output channel is indicated as faulty, an alternative path (possibly longer) replaces the original one in the header and the packet is routed normally through the faulty-free path.

Each ARDS router knows the NoC size and its own position, so it can calculate the new number of steps need for the packet in the new path. As a consequence, the router dynamically changes the target address in the header when the original address intends to use a faulty link. Although each flit in the message is re-routed, the impact on the overall message is normally low. The area overhead is small, due to the absence of virtual channels or tables, being only 1% of the router area. Moreover, the power consumption is insignificant (because on the average the opposite path will not be much larger than the original path just by changing the direction).

The adaptive routing scheme can cope with faults affecting any channels between two routers (RR_links). However, there is a set of faults for which this method by itself cannot cope with, which are faults in the links between core and routers (CR_links) and faults affecting both channels in the X direction (East and West) or in the Y directions (North and South channels) of a router. The last case can generate live-lock in the router. In these cases, we are obliged to use the faulty channel, but must somehow avoid the faulty wires to ensure the correct communication. More specific information about adaptive routing and its limitation can be found in [3].

The use of Data Splitting (DS) is then an alternative solution, when there are cases unsolved (core isolation and live-lock) by the adaptive routing strategy. This solution divides the flit into two parts, and it sends the data using two cycles. Multiplexers are placed at the inputs and outputs of each channel to select the faulty-free wires that must be used to transmit the data. Each multiplexer is configured by a register. Based in the test results, the detected faulty-free wires are selected and this information is configured by a scan-chain that connects all the configuration registers. The faulty wires are unused, and they are connected to ground or vdd. This solution works for both crosstalk and permanent faults. When there is no fault, DS solution is not used and is bypassed.

When DS is used the flits in the packet are joined when they reach the target node, this implies less performance impact. If there is more than one faulty-link in the path using DS, all the flits are divided into two when they reach the first faulty channel, and from that point forward, the flits continue divided until they reach the target node. The position of each bit is selected at the input of each faulty-link, and at each output bits are placed in the first half position in the buffer. This method allows the data bits of each flit to be distributed in the fault-free wires at each faulty-link in the path, ensuring that the data splitting impact in latency is the same, regardless of the number of faulty channels in the path.

In order to reduce the communication delay we did a simple variation of the original mapping configuration in the faults presence. Our solution is based on the mirroring the original mapping, to keep the best mapping obtained at design time, only switching the order of the cores in the NoC. Figure 2 shows an example of this concept by using the mapping of four tasks in a 2x2 Mesh NoC. For instance, Figure 2(a) shows the best mapping, while the Figures 2(b), 2(c) and 2(d) are variations of the original mapping, with the mirroring of the cores, which does not compromises the best solution. In figure 2 all cores can have 3 extra arrangements. However, in the practice, with different applications, some cores cannot have 3 extra arrangements. This happens when the mirroring allows change the position only with one core, as the cores connected to R2, R5, R8 and R11 in Figure 1. In a small number also there are cores that cannot change their position easily, as for instance the core in the middle of a 3x3 Mesh NoC.

To find a substitute core to use a faulty link with minimal impact on performance we observe the communication pattern of the original mapping and its variations. The substitute core is chosen according to equation (1):

∆delay = (#packets*packet_size)/injection_rate (1)

where #packets is the number of total packets sent through the network by the core using the faulty link, packet_size is the packet size in number of bits and injection_rate is the injection rate of the core. So, the mapping which has the core with lower ∆delay (using the faulty link) will be chosen. For instance in Figure 2, there is an intralink fault in N1_to_R1. In Figure 2(a) the core A is using the faulty link, in Figure 2(b) the core B is using the faulty link and in Figure 2(c) and Figure 2(d), respectively the cores C and D are using the faulty link. If the cores have ∆delay equal to A=0.15us; B=0.38us, C=0.45us and D=0.05us, then the mapping of the Figure 2(d) is chosen, once the core D has lower ∆delay.

In summary, firstly the fault tolerance strategy avoids the faulty link by using the adaptive routing, and for the unsolved cases (core isolation and live-lock) one uses data splitting. For the cases where data splitting has a wide impact on communication delay (basically because the faulty link is on the critical path), a simple remapping algorithm can be used like explained above.

IV. EXPERIMENTAL RESULTS

A case-study 2D-torus 3x4 NoC with ARDS routers, 12-bit channels (8 bits to transmit data, 2 bits to packet control and 2 bits to handshake) and FIFO of 4 slots has been described in VHDL. In order to obtain the best mapping, we use a tool based on Simulated Annealing that provides the best arrangement of the cores in according to the minimal costs of communication in the network. To evaluate our proposal, we use Mpeg4 benchmark [10]. The best mapping found by the tool is presented in Figure 1, and we will reference to it as original mapping. We use a standard cell library in 90nm CMOS technology with Synopsys Power Compiler tool.

Results of area, frequency, power dissipation, communication delay, energy, and connectivity are presented for ARDS. All they are described in the following subsections. The proposal is confronted with HC once related solutions presented in the literature usually cope with single faults by using Hamming Code (HC) or compare its solutions with HC.

125

R1 R2

R2 R4

N1 = A

N2 = B

N3 = C

N4= D

R1 R2

R2 R4

N1 = B

N2 = A

N3 = D

N4= C

R1 R2

R2 R4

N1 = C

N2 = D

N3 = B

N4= A

R1 R2

R2 R4

N1 = D

N2 = C

N3 = B

N4= A

(a) (b) (c) (d)

Figure 2. (a) For a regular grid NoC there are 3 arrangements possible derivate from the original mapping: (b) vertical mirroring of the original mapping; (c)

horizontal mirroring of the original mapping; (d) vertical and horizontal mirroring of the original mapping.

A. Synthesis Results

Table I presents the results of area overhead, maximum frequency and power consumption at 500MHz for original (without fault tolerance), ARDS with and without DS block active and HC routers. For HC router the channel width has 4 bits more than our proposal.

The area overhead for the ARDS router is about 28%, while for HC is 15% when compared to the original router. One must observe that the extra wires used in HC have not been counted in the 15% of area overhead. For a 3x4 2D-torus NoC, there are 72 links (48 RR_links and 24 RC_links). A network with 8 bits of data for communication and 4 for control requires 864 wires. However, HC requires 1152 wires (+33.3% of overhead in the links). For links size equal to 16 and 32 bits, HC has an overhead in links of 25% and 16% respectively. ARDS router has higher frequency than the HC router because the last uses a XOR cascade to encode and decode the data at the output and input. On other hand, ARDS router uses only multiplexers at the input and output, and when DS block is not active, only a 2:1 multiplexer is used for bypass DS block. It is also important to remember that our proposal can cope with multiple interlink and intralink faults, whereas HC only copes with a single intralink fault and multiple interlink faults.

Based on results of the Table I, ARDS without DS block active has the same power consumption of the original router for a frequency equal to 500MHz. When all DS blocks in the router are active, they increase 45% the power consumption, however HC increases the power consumption in 50% when compared to the original router. The switch activity of the XOR gates in HC consumes more power than the multiplexers that configure the wires in the DS block when there are faults. Our proposal has a variable consume in power because DS block is active only when it is needed. Thus, for one RC_link faulty one router has turned on the DS block and for one RR_link faulty two routers have the DS block turned on. Thus for a 3x4 NoC and two routers with DS block active our proposal consumes 18.34mW, while HC consumes 25.44mW, that means that Hamming increases the power consumption in 80% and our proposal in 20% when compared to a NoC without fault tolerance.

We have also estimated the impact of the links in the power consumption for the NoC. Table II presents the power consumption at 500 MHz for wire length of 0.5mm and 1mm. To measure the power consumption in the wires we performed HSPICE simulations by using the distributed π-model [12]. Hamming needs 33.3% extra wires for links size equal to 8 bits, and consequently the power consumption in the wires should increase in same proportion as once can see in table II for 0.5mm and 1mm.

TABLE I. COMPARISON OF SYNTESIS RESULTS FOR ORIGINAL,

ARDS AND HAMMING CODE ROUTERS. NoC Router Area for logic

gates (µm2) Max. Frequency

(MHz) Power Consumption

(mW@ 500MHz) Original 10.954 885 1.42 ARDS

without DS active 14.104 (+28%) 880 (-1%) 1.42

ARDS with DS active 14.104 (+28%) 588

(-33%) 2.07

(+45%)

HC 12.614 (+15%)

510 (-43%)

2.12 (+50%)

TABLE II. ESTIMATED POWER CONSUMPTION RESULTS IN ALL

WIRES OF A 3x4 2D-TORUS NoC.

Data Variation

3x4 NoC

Power Consumption

(mW@500MHz)

0.5mm 1mm

500

MHz

Original and ARDS solution (864 wires) 1.20 9.04

HC solution (1152 wires) 1.64 12.06

For wires length equal to 0.5mm the power consumption of

the set of wires is almost the same of the power consumption of one original router, however for 1mm the power consumption of the set of wires is 6 and 9 times the power consumption of the original router for the networks with ARDS and HC routers, respectively.

B. Performance Results for Mpeg4 in a 3x4 2D-Torus NoC

In order to analyze performance, the Mpeg4 benchmark [10] mapped in the NoC and presented in Figure 3 is used as case-study. We used a mapping tool to distribute the cores in a 3x4 2D-torus NoC, in order to obtain the best throughput in the network, as previously stated. It is the best core distribution for the worst case communication pattern. A cycle-accurate simulation using ModelSim has been performed to evaluate the average communication delay of the network. One measured the communication delay according to the maximum frequency of each implementation. Each packet is composed of 4 flits and cores with rate greater than 0.5 MB/s sent 50 packets of data, otherwise it is sent one packet. For a link size equal to 8 bits, we calculated the packet injection rate for each core, and this is done by multiplying the operation frequency by the link size and dividing by the rate in MB/s of each core, like equation (2):

pkt_inj_rate = max_freq (MHz) * link_size (Bytes) / rate (MB/s) (2)

The fault model considers a set of interlink and intralink faults. Several simulations have been made for some fault scenarios, and we show the average communication delay for each one. To show performance results we decided to analyze the impact of each fault case in the 3x4 NoC as shown below:

Case I. Original Router without faulty link. The communication time is the best because it runs at 885 MHz, and techniques for fault tolerance are not used.

Case II. ARDS router without faulty link. In this case adaptive routing and data splitting are not used, and the communication delay is very similar to the original router, once it uses only one multiplexer to bypass DS block.

Case III. ARDS router with intralink faults affecting RR_links using only the adaptive routing strategy. We have inserted 34 intralink faults (except in torus links), one in each RR_link to evaluate the average communication delay in this case. As Case II, the communication delay has almost no impact because DS block is not used.

126

Figure 3. Mpeg4 benchmark [10] with rates in MB/s.

Case IV. ARDS router without faulty link running at 588

MHz, it only shows the communication delay to compare with the next cases. DS block is active, but there is not any fault.

Case V. ARDS router with intralink faults affecting CR_links. We have inserted 24 intralink faults between CR_links to observe the average impact of data splitting. For this case, only DS solution can solve the problem. Only routers with faulty links turn on DS block in order to minimize power consumption and the NoC runs at 588 MHz.

Case VI. ARDS router with interlink faults affecting a RR_link and a CR_link. We inserted 192 interlink faults between a CR_link and RR_link, being 8 simulations for each CR_link with all RR_links of the neighborhood of each router. This case uses data splitting in the CR_links and adaptive routing in the RR_links, and has the communication delay higher than HC when the cores are not distributed in according to faulty links. However, we show that our proposal of core remapping in according to faults location can reduce the delay significantly, as will be presented in next section.

Case VII. HC is used in the links just for comparison of our proposal with a traditional technique. For this case, only single fault per link is acceptable, for multiple intralink faults this technique cannot solve the problem.

Table III shows the average communication delay at maximum frequency for each case previously presented. The communication delay is the average time spent to send all packets through the network. The network with HC solution has the communication delay always higher than the original network without link protection. On the other hand, ARDS has different impacts for each faulty case, and most of them have an impact lower than HC. For the cases V and VI, that the average communication delay is larger than HC, ARDS uses our mapping in accordance to the fault location.

For Mpeg4 benchmark shown in Figure 3, we obtain the mapping results presented in Table IV. For the cores VU, RAST, SRAM1 and SDRAM with RC_link faulty the best mapping to reduce the communication delay is the horizontal mirroring of the original mapping, as one can see in Table IV. The best mapping for the cores MED CPU and SRAM2 is the vertical and horizontal mirroring, and only for RISC is vertical mirroring.

TABLE III. AVERAGE COMMUNICATION DELAY FOR MPEG4

BENCHMARK IN A 3X4 2D-TORUS NOC.

Fault Cases Case I

Case II

Case III

Case IV

Case V

Case VI

Case VII

(HC)

Communication delay

([email protected].) 2.69 2.72 2.77 4.05 4.92 5.02 4.67

TABLE IV. THE BEST MAPPING IN ACCORDANCE TO FAULTY RC_LINK LOCATION.

Cores with faulty link

(Original → New Mapping)

New Mapping for Mpeg4 Cores into a 3x4 NoC

Original Mapping

VU SDRAM RAST MED CPU UP SAMP SRAM1

IDCT SRAM2 RISC AU BAB ADSP

VU → AU RAST→ ADSP

SRAM1 → RISC SDRAM → BAB

AU BAB ADSP IDCT SRAM2 RISC

MED CPU UP SAMP SRAM1 VU SDRAM RAST

MED CPU → RISC SRAM2 → UP SAMP

ADSP BAB AU RISC SRAM2 IDCT

SRAM1 UP SAMP MED CPU RAST SDRAM VU

IDCT → RISC

RAST SDRAM VU SRAM1 UP SAMP MED CPU RISC SRAM2 IDCT ADSP BAB AU

When a fault affects the RC_link connecting routers in the

cores AU, ADSP, RISC, BAB and UP SAMP, the remapping is not done. Thus, there are 5 cores that cannot change position, and three of them do not affect the communication delay (AU, ADSP and RISC). Only BAB and UP SAMP are the cores that cannot be remapped, because other cores (SRAM2 and SDRAM) will increase the communication delay.

The communication delay for each core replaced can be compared in Figure 4 with the original mapping and after the remapping. The constant line means communication delay for HC. Considering the results for the Mpeg4 benchmark with 12 cores, one can see that the communication delay may be reduced with our remapping for 7 cores. Thus, we can see in Figure 4 that our proposal can improve the performance in 8 of 12 cases, and it always gains on average cases.

The average communication delay for the critical cases (Case V and Case VI) is shown in Table V. Table V shows that our remapping solution reduces until 11% the average communication delay in relation to the original mapping with faults, and also reduces 5.4% when compared to HC. Even in presence of interlink faults (Case VI), the solution keeps lower communication delays than HC.

Figure 5 shows the energy for a 3x4 NoC with Mpeg4 benchmark for 12 faulty cases (one in each RC_link). Our proposal can save around 35% of energy when compared with HC. Besides, to HC the energy is always constant regardless of the faulty link.

Figure 4. Communication delay for faulty RC_link with original

mapping and after remapping.

127

TABLE V. COMPARISON OF AVERAGE COMMUNICATION DELAY FOR CRITICAL CASES.

Evaluated Cases for 3x4 NoC with Mpeg4 Benchmark

Communication Delay (us)

Case V 4.92 Case VI 5.02

Case VII (HC) 4.67 Case V with remapping 4.43 Case VI with remapping 4.52

C. Connectivity

Figure 6 shows the connectivity of the NoC (total availability of the entire set of links) versus number of faults, for some specific cases. We evaluate the connectivity for our case-study with Mpeg4 benchmark in a 3x4 and in an 8x8 2D-torus NoC. We consider the best and the worst cases, being the best case when faults are distributed among wires with less than 50% of faulty wires in each link. In this case ARDS always find a solution. The worst case is when there are more that 50% of faults in each link, and the link cannot be used. Nevertheless the probability of a wire be faulty in each link is 1/72 for a 3x4 NoC and 1/384 in an 8x8 NoC. For 100 faulty wires distributed randomly in all links of the NoC, a 3x4 NoC has at least 70% of connectivity availability while the 8x8 NoC has at least 94% of connectivity availability, for the worst cases.

V. CONCLUSION This paper presented a NoC developed with adaptive fault

tolerance in the links. The technique is applied after test phase and can protect the NoC against permanent faults and crosstalk. We also present an easy remap for fault cases to reduce communication time. We used a very simple adaptive routing combined with data splitting when adaptive routing does not solve the problem, due to some combination of defects in the links.

Our proposal can protect the NoC against multiple faults in links, and also multiple faulty-links with only 28% of area overhead. Our approach has a variable impact in performance and, in almost all cases after the remapping have lower communication delay than Hamming Code. Besides, our proposal can save a considerable energy when compared with Hamming Code, and can keep 100% of connectivity in the best cases of fault distribution.

Our proposal also does not need extra wires in the links, contributing mainly for minimize power and energy when compared to Hamming Code. Our re-mapping has been done considering only one faulty RC_link, because faulty RR_links have almost no impact using re-routing. As future works we intend to analyze our strategy with yield and QoS.

REFERENCES

[1] S. Furver, Living with Failure: Lessons from Nature? Proceedings of the Eleventh IEEE European Test Symposium (ETS’06)-Vol.00, p.4-8, 2006. [2] Hhih-yu Yang and Christos A. Papachristou, A Method for Detecting Interconnect DSM Defects in Systems on Chip, IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.25, no. 1, pp.197-204, 2006. [3] C. Concatto, P. Almeida, F. Kastensmidt, E. Cota, M. Lubaszewski and M. Hervé, Improving Yield of Torus NoCs through Fault-Diagnosis-And-Repair of Interconnect Faults, 15th IEEE International On-Line Testing Symposium (IOLTS), pp. 61-66, 2009.

[4] Frantz A., Kastensmidt F., Carro L., Cota E., Dependable Network-on-Chip Router Able to Simultaneously Tolerate Soft Errors and Crosstalk. Proceedings International Test Conference (ITC), vol. 1, pp. 1 – 9, 2006. [5] Zhen Z., Greiner A., Taktak S. A Reconfigurable Routing Algorithm for Fault-Tolerant 2D-Mesh Network-on-Chip. Proceedings of the 45th annual conference on Design automation, pp.441-446, 2008. [6] Braga M., Cota E., Kastensmidt F., Lubaszewski M., Efficiently using data Splitting and Retrasmission to Tolerante Faults in Networks-on-Chip Interconnects, IEEE International Symposium on Circuits and Systems (ISCAS), 2010. [7] T. Lehtonen, P. Liljeberg, and J. Plosila, Online Reconfigurable Self-Timed Links for Fault Tolerant NoCs, IEEE International VLSI Design, 2007. [8] Schönwald T., Zimmermann J., Bringmann O., Rosenstiel W. Fully Adaptative Fault-Tolerant Routing Algorithm for Network-on-Chip Architectures, 10th Euromicro Conference on Digital System Design Architecture, Methods and Tools, pp. 527-534, 2007. [9] Koibuchi M., Matsutani H., Amano H., Pinkston M. A Lightweight Fault-Tolerant Mechanism for Network-on-Chip, Second ACM/ IEEE International Symposium on Networks-on-Chip, pp. 13-22, 2008. [10] Bertozzi, D. et al., NoC Synthesis Flow for Customized Domain Specific Multiprocessor Systems-on-Chip, IEEE Transaction on Parallel and Distributed System, 2005, pp. 113-129. [11] C. Zefferino and A. Susin, SoCIN: A Parametric and Scalable Network-on-Chip, 16th Symposium on Integrated Circuits and System Design, pp. 169-174, 2003. [12] Sakurai T., Approximation of wiring delay in MOSFET LSI, IEEE Journal of Solid-State Circuits, Vol.SC-18, No.4, Aug. 1983. [13] Rafael Tornero, Valentino Sterrantino, Maurizio Palesi, Juan M. Orduna, "A multi-objective strategy for concurrent mapping and routing in networks on chip," ipdps, IEEE International Symposium on Parallel&Distributed Processing, pp.1-8, 2009. [14] Anirban Dutta Choudhury, Gianluca Palermo, Cristina Silvano and Vittorio Zaccaria. "Yield Enhancement by Robust Application-specific Mapping on Network-on-Chips", In NoCArc'09 - Second International Workshop on Network on-Chip Architectures. New York City, USA, pp. 37-42, 2009. [15]Ganguly, A.; Pande, P.P.; Belzer, B.; "Crosstalk-Aware Channel Coding Schemes for Energy Efficient and Reliable NOC Interconnects," Very Large Scale Integration (VLSI) Systems, IEEE Transactions on , vol.17, no.11, pp.1626-1639, Nov. 2009. [16] Palesi, M.; Kumar, S.; Catania, V.; "Leveraging Partially Faulty Links Usage for Enhancing Yield and Performance in Networks-on-Chip," Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on , vol.29, no.3, pp.426-440, March 2010.

Figure 5. Energy dissipated in each simulation with a faulty RC_link

for each core, after the remapping. Hamming is always constant.

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90 100

% C

onne

ctiv

ity

# Faults

Best Case 3x4 and 8x8 NoC

Worst Case 3x4 NoC

Worst Case 8x8 NoC

Figure 6. Network connectivity vs Faults.

128