A Case for Globally Shared-Medium On- Chip Interconnect Enhancing Effective Throughput for Transmission Line-Based Bus Aaron Carpenter, Jianyun Hu, Jie

A Case for Globally Shared-Medium On-Chip Interconnect

Enhancing Effective Throughput for Transmission Line-Based Bus

Aaron Carpenter, Jianyun Hu, Jie Xu,Michael Huang, Hui WuUniversity of Rochester

Motivation: e.g. 5x5 mesh

Worse case: 4+4 = 8 hopsPer hop = pipeline delay + queue delayExample: 5 + 10 = 15 clock cycles/hopWC 15 * 8 = 120 clock cycles @ 1G Hz clock = 120 nsMuch slower than DRAM access

Motivation

• Non-uniform cache access (NUCA) delays create problems.

• Significant existing research aimed to reduce unnecessary remote accesses by trying to map data closer to the threads that frequently access the data.

Motivation• Transmission-line circuit

technology allows data rates at >= 26 GHz/s = 0.04 ns per bit.

• Latency across chip ~ 2 ns.

• Claims to significantly reduce power because no power costs at intermediate routers (and queues).

Their Proposed Architecture• Use Transmission-Lines (TLs） to create a shared

bus:– Two-level network: first-level connects 2-4 nodes

per hub.– Shared bus connects all hubs.– Within a hub, can connect nodes via e.g. crossbar.– Centralized arbitration to control bus access.

Layout

Serpentine routing through every hubs

Arbitration• When the message want transfer from node i to j: 1. A setup step is performed to “wake up” the

transmitter i.2. In the background, the arbiter passes on the grant

to node j 3. Need the time to drain the signal (waiting for the

last bit is transmitted). 4. Arbiter can process next task.

Implementation problems• Where to put arbiter?• How to account for the communicate delay for getting

requests from nodes to arbiter and grants back?• The overhead of routing request/grant lines between

arbiter and nodes?

Put arbiter in the middle?

Outline of Remaining Talk• Transmission Line• transmission line medium• transceiver circuitry

• Node structure• Bus Architecture• Arbitration• Interface Circuit Design

Transmission Line

• transmission line medium• Microstrips: simple, isolation, each line can support high data

rate(> 20Gb/s)• crosstalk from neighboring lines requires very large spacing• Coplanar waveguides: use a grounded strip in between the signal

lines • significant spacing between signal lines• coplanar strips: the more noise-tolerant differential signaling on a

pair of lines

transceiver circuitry• digital systems• analog receiver: allows more attenuation and thus

higher rates than digital systems• analog transmitters: can be used to gather with more

sophisticated encoding schemes

In their design:• coplanar strips: as they utilize the space of the top metal

layer more efficiently • basic differential transmitters and receivers• a data rate of 26.4Gb/s can be achieved for a pair of

transmission lines with a total pitch (including spacing) of 45μm

• Within 2.5mm of space, this pitch allows 55 pairs to be laid out, allowing 1.4 5Tb/s of total bandwidth

Node structure

• assumption is that a chip consists of tiles• each with a core, an L1 cache, and a slice of a

globally shared L2 (last-level) cache.• if an L1 miss occurs, the access will result in a

packet injected into the interconnect if the address maps to a remote node

• Otherwise, the L1 miss is served by the local L2 bank

Node structure• clustering a small number of cores and L2

slices into a node• the backbone network only makes a stop at

every node• intra-node fabric connects multiple L1 caches

and the L2 cache banks in the node

performance• clustering adds extra latency for accesses from an L1 cache to the

nearest L2 bank(Figure 4-b Core0 to L20)• makes accessing neighboring cache banks within the node

(Figure 4-b Core1 to L20) faster• it reduces the number of hubs a long-distance packet needs to

traverse through• The extra cost of a larger intra-node fabric offsets the savings due

to a lower number of hubs for inter-node fabric

Bus Architecture• Each node uses a high speed communication

circuit to deliver packets• our bus is merely that allows point-to-point

communication

Partitioning the bus• Increase throughput, use a wide bus• have multiple buses for diffirent packets.• bundling: for better utilization of the bus

bandwidth, sending multiple packets for each bus arbitration

Interface Circuit Design• a transmitter, a receiver, a serializer (SER), a deserializer (DES), and a

phase and data recovery circuit (PDR).• Therefore, the transmitter (Tx) and receiver (Rx) are both implemented in

standard CMOS technology without any special RF devices such as inductors. At 26.4Gb/s

• synchronization between the received data and the local clock is needed

Increasing Effective Bus Throughput

• There are many ways to increase the throughput of bus at circuits or architecture level. The proposed techniques can be categorized into three groups:

1. Increasing raw link throughput.2. Increasing the utilization efficiency.3. Optimization on the use of buses.

Increasing raw link throughput

• The potential of link throughput is high, the inherent channel bandwidth of the transmission line is quite high.

• There are many coding methods to increase the raw throughput.


First, we turn to 4-PAM which double the data rate compared to OOK. The additional circuit has a DAC for transmitter and ADC for receiver. These elements increase energy and latency, we use it only for data packet bus to minimize latency impact.


Then we use Frequency Division Multiplexing (FDM), it allows us to use higher frequency band. The attenuation in these band increase with frequency and can be high. When it used as global bus, the higher band becomes lossy. The higher frequency channel are intended for shorter communication instead of in long transmission lines.


We also have a circuit support includes mixer for transmitter and receiver side and a filter for receiver end. But it is challenging to estimate the power cost of support circuitry. We use a simplify analysis to estimate the minimum power cost to support frequency-division and multi-band transmission.

Increasing the Utilization Efficiency• While the underlying global transmission lines support high

data rate. Using them to shuttle short packet can cause under-utilization:

1. Long lines means it take long time to drain from transmission line. 2. Packet destined for near neighbor structure are poor match to the global line structure.• A number of technique can address these issues, including:Partitioning, wave-based arbitration, segmentation

Partitioning• It is straightforward to partition the same number of

underlying links into more, narrower buses. Longer serialization reduces waste due to draining.

• In partitioning, the finer granularity allows better balance the load of two type of buses.

• For example, we can partition the five 1-flit-wide buses into any combination of meta bus and data bus. In this paper, we use a fixed configuration that achieve the best average performance.

Segmentation• We can also improve its spatial utilization in order to increase

the efficiency.• Achieve that by dividing transmission line into few segments.

If a node is communicating with another node within the same segment, only need to arbitrate this segment.

• When communication cross multiple segments, transmitter need to obtain permission for all segments. Then the segment act as a transmission line.

Segmentation

• The segment can be connected in two ways:1:Pass gate is a passive, bi-directional connection. It will add a little bit attenuation and signal distortion, but it can be accepted.2: Two separate uni-directional amplifiers. The cost of this approach is the power consumption for the amplifier. But with these amplifiers, source transmitter power can be lower since signal can travel at most the length of one segment.

Optimization on the use of buses

• Invalidation acknowledgement omission: With a packet-switched network, protocols rely explicit invalidation acknowledgement to provide completion.The explicit acknowledgement can be avoided if the interconnect offers certain capability to infer the deliver.• Limited multicasting: Transmission line can allow multicast operation. It is easy to support small number of receiver operating. But there is a acceptable attenuation. Even though it may not reduce traffic dramatic, it cut latency and queuing delay.

Interaction between techniques

• These three groups of techniques are focus different sources of performance gain. But within each group, there is a varying degree of overlap.

• In general, implementing one technique reduce the potential of another. So when multiple techniques are applied, we can reach diminishing returns.

• Example: When we are tying to increase the utilization efficiency, we send a pulse train on bus, we wait until it propagate beyond the ends before allowing another pulse. Since propagation delay is significant than pulsed train, the duty cycle is low. But we are trying to improve the duty cycle in different ways.

Experimental Setup• Transmission Line Linksa total pitch of 45μm and a line width of 10μmThe transmission lines are of a serpentine shape and measure about 7.5cm

in total length

· Traffic and Performance AnalysisThe L1 miss rate of these applications ranges up to 61 misses per thousand

instructions (MPKI).a. Percentage of L2 accesses that are remoteb. Speedup due to clusteringleft is for 1 core per node, the right bar is for 2 cores per node. The baseline in this case is a 16-core mesh

Performance comparison with mesh• On average, TLL bus run 1.15x in the 16-node

and 1.17x in the 8-node configurations than mesh.

• the TLL bus reduction in network energy of about 26x than mesh

The Impact of Bundling• the turn-around time also wastes bus bandwidth and can

be mitigated with bundling• too much bundling can be detrimental to performance as

well

Scaling Up performance compare with mesh

• We conduct a limited scalability test with a 64-core system organized into 2- or 4-core nodes (32 nodes, 2 cores each; and 16 nodes, 4 cores each)

• On average, the TLL bus performs 16% and 25% better than mesh for a 32- and 16-node system

• the bus system achieves 67% and 72% of the idealized performance (using digital wire), for 32- and 16-nodes respectively.

• in a 16-core 8-node system, the bus can achieve 91% of the ideal’s performance.

Scaling Up performance compare with idealized circuit

CONCLUSIONS• main-stream chip multiprocessors are unlikely to require an

extreme amount of bandwidth for on-chip backbone communication

• only a small number of nodes will be connected by packet-based backbone interconnect and the traffic on this fabric can be rather limited

• Experimental shown in a medium-scale16-core system, this design achieves 91% of that in an idealized wire-based interconnect

• important benefit of avoiding packet switching and relaying is the inherent energy efficiency of the communication system.

Documents

A Case for Globally Shared-Medium On- Chip Interconnect Enhancing Effective Throughput for Transmission Line-Based Bus Aaron Carpenter, Jianyun Hu, Jie