Quick Path Interconnect Introduction Paper

8/12/2019 Quick Path Interconnect Introduction Paper

1/22

Document Number: 320412-001US

An Introduction to the Intel QuickPath Interconnect

Ja n u a r y 2 0 0 9


2/22

2 An Introduction to the Intel QuickPath Interconnect

Legal Lines andDisclaimers

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT ASPROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDINGLIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANYPATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving,life sustaining, critical control or safety systems, or in nuclear facility applications.

Intel may make changes to specifications and product descriptions at any time, without notice.

Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined. Intelreserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from futurechanges to them.

Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family,not across different processor families. See http://www.intel.com/products/processor_number for details.

Products which implement the Intel QuickPath Interconnect may contain design defects or errors known as errata which maycause the product to deviate f rom published specifications. Current characterized errata are available on request.

Any code names presented in this document are only for use by Intel t o identify products, technologies, or services indevelopment, that have not been made commercially available to the public, i.e., announced, launched or shipped. They are not

"commercial" names for products or services and are not intended to function as trademarks.Contact your local Intel sales of fice or your distributor to obtain the latest specifications and before placing your product order.

Copies of documents which have an order number and are referenced in this document, or other Intel literature may be obtainedby calling 1-800-548-4725 or by visiting Intel's website at http://www.intel.com .

Intel, Core, Pentium Pro, Xeon, Intel Interconnect BIST (Intel BIST), and the Intel logo are trademarks of Intel Corporation in theUnited States and other countries.

*Other names and brands may be claimed as the property of others.

Copyright 2009, Intel Corporation. All Rights Reserved.

Notice: This document contains information on products in the design phase of development. The information here is subject to change withoutnotice. Do not finalize a design with this information.
http://www.intel.com/products/processor_numberhttp://www.intel.com/http://www.intel.com/http://www.intel.com/products/processor_number


3/22

An Introduction to the Intel QuickPath Interconnect 3

Contents

Executive Overview ..............................................................................................................5

Introduction ........................................................................................................................5

Paper Scope and Organization .................................................... ........................................... 6

Evolution of Processor Interface .................................................... ......................................... 6

Interconnect Overview ..........................................................................................................8

Interconnect Details .................................................. .......................................................... 10

Physical Layer ......................................................... ..................................................... ..... 10

Link Layer ................................................. .................................................... .................... 12

Routing Layer ................................................ .................................................... ................ 14

Transport Layer .......................................................................... ....................................... 14

Protocol Layer ....................................................... ................................................... ......... 15

Performance ........................................... ................................................... ........................ 19

Reliability, Availability, and Serviceability ....................................................... ........................ 21

Processor Bus Applications ....... ........................................................... ................................ 21

Summary ......................................... ................................................. ................................ 22


4/22


Revision History

DocumentNumber

RevisionNumber Description Date

320412 -001US Initial Release January 2009


5/22


Executive Overview

Intel microprocessors advance theirperformance ascension through ongoing

microarchitecture evolutions and multi-coreproliferation. The processor interconnect hassimilarly evolved (see Figure 1 ), thereby keepingpace with microprocessor needs through fasterbuses, quad-pumped buses, dual independentbuses (DIB), dedicated high-speed interconnects(DHSI), and now the Intel QuickPathInterconnect.

The Intel QuickPath Interconnect is a high-speed, packetized, point-to-point interconnect

used in Intels next generation of microprocessorsfirst produced in the second half of 2008. Thenarrow high-speed links stitch togetherprocessors in a distributed shared memory 1 -styleplatform architecture. Compared with todayswide front-side buses, it offers much higherbandwidth with low latency. The Intel QuickPathInterconnect has an efficient architecture allowingmore interconnect performance to be achieved inreal systems. It has a snoop protocol optimizedfor low latency and high scalability, as well as

packet and lane structures enabling quickcompletions of transactions. Reliability,availability, and serviceability features (RAS) arebuilt into the architecture to meet the needs ofeven the most mission-critical servers. With thiscompelling mix of performance and features, itsevident that the Intel QuickPath Interconnectprovides the foundation for future generations ofIntel microprocessors and that various vendorsare designing innovative products around thisinterconnect technology.

Figure 1. Performance and BandwidthEvolution

Introduction

For the purpose of this paper, we will start ourdiscussion with the introduction of the Intel Pentium Pro processor in 1992. The Pentium Pro microprocessor was the first Intel architecturemicroprocessor to support symmetricmultiprocessing in various multiprocessorconfigurations. Since then, Intel has produced a

steady flow of products that are designed to meetthe rigorous needs of servers and workstations.In 2008, just over fifteen years after thePentium Pro processor was introduced, Inteltook another big step in enterprise computingwith the production of processors for the nextgeneration of Intel 45-nm Hi-k Intel Coremicroarchitecture.

1. Scalable shared memory or distributed shared-memory (DSM) architectures are terms from Hennessy& Patterson, Computing Architecture: A QuantitativeApproach. It can also be referred to as NUMA, non-uniform memory access.

1985 1990 1995 2000 2005 2010

Years

R e

l a t i v e

I n t e g e

t T h r o u g

h p u t

Quad-PumpedBus Pentium 4

64-bitData BusPentium

DHSI IntelCore

DIB

Double-PumpedBus


6/22


Figure 2. Intel QuickPath Architecture

The processors based on next-generation, 45-nmHi-k Intel Core microarchitecture also utilize anew system of framework for Intel micro-processors called the Intel QuickPathArchitecture (see Figure 2 ). This architecturegenerally includes memory controllers integratedinto the microprocessors, which are connectedtogether with a high-speed, point-to-pointinterconnect. The new Intel QuickPathInterconnect provides high bandwidth and lowlatency, which deliver the interconnectperformance needed to unleash the newmicroarchitecture and deliver the Reliability,Availability, and Serviceability (RAS) featuresexpected in enterprise applications. This newinterconnect is one piece of a balanced platformapproach to achieving superior performance. It isa key ingredient in keeping pace with the nextgeneration of microprocessors.

Paper Scope and Organization

The remainder of this paper will describe featuresof the Intel QuickPath Interconnect. First, ashort overview of the evolution of the processorinterface, including the Intel QuickPathInterconnect, is provided then each of the Intel

QuickPath Interconnect architectural layers isdefined, an overview of the coherency protocoldescribed, board layout features surveyed, and

some initial performance expectations provided.The primary audience for this paper is thetechnology enthusiast as well as other individualswho want to understand the future direction ofthis interconnect technology and the benefits itoffers. Although all efforts are made to ensureaccuracy, the information provided in this paper isbrief and should not be used to make designdecisions. Please contact your Intel representativefor technical design collateral, as appropriate.

Evolution of Processor Interface

In older shared bus systems, as shown inFigure 3 , all traffic is sent across a single sharedbi-directional bus, also known as the front-sidebus (FSB). These wide buses (64-bit for Intel Xeon processors and 128-bit for Intel Itanium processors) bring in multiple data bytesat a time. The challenge to this approach was theelectrical constraints encountered with increasingthe frequency of the wide source synchronousbuses. To get around this, Intel evolved the busthrough a series of technology improvements.

Initially in the late 1990s, data was clocked in at

2X the bus clock, also called double-pumped.Todays Intel Xeon processor FSBs are quad-pumped, bringing the data in at 4X the bus clock.The top theoretical data rate on FSBs today is1.6 GT/s.


7/22


Figure 3. Shared Front-side Bus, upuntil 2004

To further increase the bandwidth of the front-side bus based platforms, the single-shared busapproach evolved into dual independent buses(DIB), as depicted in Figure 4 . DIB designsessentially doubled the available bandwidth.However, all snoop traffic had to be broadcast onboth buses, and if left unchecked, would reduceeffective bandwidth. To minimize this problem,

snoop filters were employed in the chipset tocache snoop information, thereby significantlyreducing bandwidth loading.

Figure 4. Dual Independent Buses,circa 2005

The DIB approach was extended to its logicalconclusion with the introduction of dedicatedhigh-speed interconnects (DHSI), as shown inFigure 5 . DHSI-based platforms use four FSBs,one for each processor in the platform. Again,snoop filters were employed to achieve bandwidthscaling.

Figure 5. Dedicated High-speedInterconnects, 2007

processor processor processor

chipsetMemoryInterface

I/O

processor

Up to 4.2GB/sPlatform Bandwidth


chipsetMemoryInterface

I/O

processor

Up to 12.8GB/sPlatform Bandwidth

snoop filter


chipset

Memory

Interface

I/O

processor

Up to 34GB/sPlatform Bandwidth

snoop filter


8/22


With the production of processors based on next-generation, 45-nm Hi-k Intel Coremicroarchitecture, the Intel Xeon processorfabric will transition from a DHSI, with thememory controller in the chipset, to a distributedshared memory architecture using Intel QuickPath Interconnects. This configuration isshown in Figure 6 . With its narrow uni-directionallinks based on differential signaling, the Intel QuickPath Interconnect is able to achievesubstantially higher signaling rates, therebydelivering the processor interconnect bandwidthnecessary to meet the demands of futureprocessor generations.

Figure 6. Intel QuickPathInterconnect

Interconnect Overview

The Intel QuickPath Interconnect is a high-speed point-to-point interconnect. Thoughsometimes classified as a serial bus, it is moreaccurately considered a point-to-point link as datais sent in parallel across multiple lanes andpackets are broken into multiple paralleltransfers. It is a contemporary design that uses

some techniques similar to other point-to-pointinterconnects, such as PCI Express* and Fully-Buffered DIMMs. There are, of course, somenotable differences between these approaches,which reflect the fact that these interconnectswere designed for different applications. Some ofthese similarities and differences will be exploredlater in this paper.

Figure 7 shows a schematic of a processor withexternal Intel QuickPath Interconnects. Theprocessor may have one or more cores. Whenmultiple cores are present, they may sharecaches or have separate caches. The processoralso typically has one or more integrated memory

controllers. Based on the level of scalabilitysupported in the processor, it may include anintegrated crossbar router and more than oneIntel QuickPath Interconnect port (a portcontains a pair of uni-directional links).

Figure 7. Block Diagram of Processorwith Intel QuickPathInterconnects

MemoryInterface

I/O

MemoryInterface

MemoryInterface

MemoryInterface

chipset

I/O

chipset

processor

processor processor

processor

Legend:

Bi-directional busUni-directional link

core core core

I n t e g r a t e d

M e m o r y C o n t r o l l e r ( s )

Crossbar Router /Non-routing

global links interface

MemoryInterface

Processor Cores

Intel

QuickPathinterconnects


9/22


The physical connectivity of each interconnect linkis made up of twenty differential signal pairs plusa differential forwarded clock. Each port supportsa link pair consisting of two uni-directional links tocomplete the connection between twocomponents. This supports traffic in bothdirections simultaneously. To facilitate flexibilityand longevity, the interconnect is defined ashaving five layers (see Figure 8 ): Physical, Link,Routing, Transport, and Protocol.

The Physical layer consists of the actual wirescarrying the signals, as well as circuitry andlogic to support ancillary features required inthe transmission and receipt of the 1s and 0s.

The unit of transfer at the Physical layer is 20-bits, which is called a Phit (for Physical unit).

The next layer up the stack is the Link layer,which is responsible for reliable transmissionand flow control. The Link layers unit oftransfer is an 80-bit Flit (for Flow control unit).

The Routing layer provides the framework fordirecting packets through the fabric.

The Transport layer is an architecturallydefined layer (not implemented in the initial

products) providing advanced routingcapability for reliable end-to-endtransmission.

The Protocol layer is the high-level set of rulesfor exchanging packets of data betweendevices. A packet is comprised of an integralnumber of Flits.

The Intel QuickPath Interconnect includes acache coherency protocol to keep the distributedmemory and caching structures coherent during

system operation. It supports both low-latencysource snooping and a scalable home snoopbehavior. The coherency protocol provides fordirect cache-to-cache transfers for optimallatency.

Figure 8. Architectural Layers of theIntel QuickPathInterconnect

Within the Physical layer are several features(such as lane/polarity reversal, data recovery anddeskew circuits, and waveform equalization) thatease the design of the high-speed link. Initial bitrates supported are 4.8 GT/s and 6.4 GT/s. Withthe ability to transfer 16 bits of data payload perforwarded clock edge, this translates to 19.2 GB/sand 25.6 GB/s of theoretical peak data bandwidthper link pair.

At these high bit rates, RAS requirements are metthrough advanced features which include: CRCerror detection, link-level retry for error recovery,hot-plug support, clock fail-over, and link self-healing. With this combination of features,

performance, and modularity, the Intel QuickPath Interconnect provides a high-bandwidth, low-latency interconnect solutioncapable of unleashing the performance potentialof the next-generation of Intel microarchitecture.

Protocol

Device #1

Transport

Routing

Link

Physical

Tx Rx

Protocol

Device #2

Transport

Routing

Link

Physical

Tx Rx


10/22


Interconnect Details

The remainder of this paper describes the Intel QuickPath Interconnect in more detail. Each of the

layers will be defined, an overview of thecoherency protocol described, board layoutfeatures surveyed and some initial thoughts onperformance provided.

Physical Layer

Figure 9. Physical Layer Diagram

The Physical layer consists of the actual wirescarrying the signals, as well as the circuitry andlogic required to provide all features related to thetransmission and receipt of the information

transferred across the link. A link pair consists oftwo uni-directional links that operatesimultaneously. Each full link is comprised oftwenty 1-bit lanes that use differential signalingand are DC coupled. The specification definesoperation in full, half, and quarter widths. Theoperational width is identified during initialization,which can be initiated during operation as part ofa RAS event. A Phit contains all the informationtransferred by the Physical layer on a single clockedge. At full-width that would be 20 bits, at half-

width 10 bits and at quarter-width 5 bits. A Flit(see the Link Layer section) is always 80 bitsregardless of the link width, so the number ofPhits needed to transmit a Flit will increase by afactor of two or four for half and quarter-widthlinks, respectively.

Each link also has one and only one forwardedclock. The clock lane is required for eachdirection. In all, there are a total of eighty-four(84) individual signals to make up a single Intel QuickPath Interconnect port. This is significantlyfewer signals than the wider, 64-bit front-sidebus, which has approximately 150 pins. Alltransactions are encapsulated across these links,including configuration and all interrupts. Thereare no side-band signals. Table 1 compares twodifferent interconnect technologies.

Table 1. Processor InterconnectComparison

1 Diff. stands for differential signaling.2 SrcSync stands for source synchronous data sampling.FwdClk(s) means forwarded clock(s).3 Source: Intel internal presentation, December, 2007.

The link is operated at a double-data rate,meaning the data bit rate is twice the forwardedclock frequency. The forwarded clock is a separatesignal, not an encoded clock as used in PCI

Express* Gen 1 and Gen 2. The initial productimplementations are targeting bit rates of 6.4GT/s and 4.8 GT/s. Table 2 compares the bitrates of different interconnect technologies.

Data Signal Pairs20

Clock Signal Pair

Device #1

Rx

Device #2

Rx

Data Signal Pairs 20

Clock Signal Pair

Tx

Tx I n t e l

F r o n

t - S i d e

B u s

3

I n t e l

Q u

i c k P a t h

I n t e r c o n n e c t 3

Topology Bus Link

Signaling Technology 1 GTL+ Diff.

Rx Data Sampling 2 SrcSync FwdClk

Bus Width (bits) 64 20

Max Data Transfer Width 64 16

Requires Side-band Signals Yes No

Total Number of Pins 150 84

Clocks Per Bus 1 1

Bi-directional Bus Yes No

Coupling DC DC

Requires 8/10-bit encoding No No


11/22


Table 2. Contemporary Interconnect BitRates

1 Transfer rate available on Intel Xeon Processor-basedWorkstation Platform.2 Transfer rate available on Intel Xeon Processor-basedWorkstation Platform.3 Source: PCI Express* 3.0 Frequently Asked Questions, PCI-SIG, August 2007.

The receivers include data recovery circuitry

which can allow up to several bit-time intervals ofskew between data lanes and the forwardedclock. De-skew is performed on each laneindependently. The amount of skew that can betolerated is product-specific and design collateralshould be consulted for specifics. On thetransmitter side, waveform equalization is used toobtain proper electrical characteristics. Signalintegrity simulation is required to define the tapsettings for waveform equalization. Routinglengths are also product and frequency specific.

One implementation example would be that therouting length could vary from 14 to 24 withzero to two connectors and 6.4 GT/s to 4.8 GT/s.Again, consult the design collateral and your Inteltechnical representative for specifics. To easerouting constraints, both lane and polarityreversal are supported. Polarity reversal allows anindividual lanes differential pair to be swapped.In addition to reversing a lane, an entire link canbe reversed to ease routing. So lanes 0 to 19 canbe connected to 19 to 0 and the initialization logic

would adjust for the proper data transfer. Thisaids board routing by avoiding cross-over ofsignals. Both types of reversal are discoveredautomatically by the receiver during initialization.The logic configures itself appropriately toaccommodate the reversal without any need forexternal means, such as straps or configurationregisters.

Logic circuits in the Physical layer are responsiblefor the link reset, initialization and training. Likeother high-speed links, the Physical layer isdesigned to expect a low rate of bit errors due torandom noise and jitter in the system. Theseerrors are routinely detected and correctedthrough functions in the Link layer. The Physicallayer also performs periodic retraining to avoidthe Bit Error Rate (BER) from exceeding thespecified threshold. The Physical layer supportsloop-back test modes using the Intel Interconnect Built-In Self Test (Intel IBIST).Intel IBIST tools provide a mechanism fortesting the entire interconnect path at fulloperational speed without the need for externaltest equipment. These tools greatly assist in thevalidation efforts required to ensure properplatform design integrity.

Technology Bit Rate (GT/s)

Intel front-side bus 1.6 1

Intel QuickPath Int erconnect 6.4/4.8

Fully-Buffered DIMM 4.0 2

PCI Express* Gen1 3 2.5




12/22


Link Layer

The Link layer has three main responsibilities: (1)it guarantees reliable data transfer between two

Intel QuickPath Interconnect protocol or routingentities; (2) it is responsible for the flow controlbetween two protocol agents; and (3) inabstracting the Physical layer, it provides servicesto the higher layers.

The smallest unit of measure at the Link layer is aFlit (flow control unit). Each Flit is 80 bits long,which on a full-width link would translate to fourPhits. The Link layer presents a set of higher-levelservices to the stack. These services include

multiple message classes and multiple virtualnetworks, and together are used to preventprotocol deadlocks.

Message Classes

The Link layer supports up to fourteen (14)Protocol layer message classes of which six arecurrently defined. The remaining eight messageclasses are reserved for future use. The messageclasses provide independent transmissionchannels (virtual channels) to the Protocol layer,thereby allowing sharing of the physical channel.

The message classes are shown in Table 3 . Themessages with the SNP, NDR, DRS, NCS and NCBmessage encodings are unordered. An unorderedchannel has no required relationship between theorder in which messages are sent on that channeland the order in which they are received. TheHOM message class does have some orderingrequirements with which components are requiredto comply.

Table 3. Message Classes

Virtual Networks

Virtual networks provide the Link layer with anadditional method for replicating each messageclass into independent virtual channels. Virtualnetworks facilitate the support of a variety offeatures, including reliable routing, support for

complex network topologies, and a reduction inrequired buffering through adaptively bufferedvirtual networks.

The Link layer supports up to three virtualnetworks. Each message class is subdividedamong the three virtual networks. There are up totwo independently buffered virtual networks (VN0and VN1) and one shared adaptive bufferedvirtual network (VNA). See Figure 10 . The totalnumber of virtual channels supported is the

product of the virtual networks supported and themessage classes supported. For the Intel QuickPath Interconnect Link layer this is amaximum of eighteen virtual channels (three SNP,three HOM, three NDR, three DRS, three NCS,and three NCB).

Name Abbr Ordering Data

Snoop SNP None NoHome HOM Required No

Non-data Response NDR None No

Data Response DRS None Yes

Non-coherent Standard NCS None No

Non-coherent Bypass NCB None Yes


13/22


Figure 10. Virtual Networks

Note : Only 4 of 6 message classes shown.

Credit/Debit Scheme

The Link layer uses a credit/debit scheme for flow

control, as depicted in Figure 11 . Duringinitialization, a sender is given a set number ofcredits to send packets, or Flits, to a receiver.Whenever a packet or Flit is sent to the receiver,the sender decrements its credit counters by onecredit, which can represent either a packet or aFlit depending on the type of virtual networkbeing used. Whenever a buffer is freed at thereceiver, a credit is returned to the sender for thatbuffer. When the senders credits for a givenchannel have been exhausted, it stops sending on

that channel. Each packet contains an embeddedflow control stream. This flow control streamreturns credits from a receiving Link layer entityto a sending Link layer entity. Credits are returnedafter the receiving Link layer has consumed thereceived information, freed the appropriatebuffers, and is ready to receive more informationinto those buffers.

Figure 11. Credit/Debit Flow Control

Reliable Transmission

CRC error checking and recovery procedures areprovided by the Link layer to isolate the effects ofroutine bit errors (which occur on the physicalinterconnect) from having any higher layer

impact. The Link layer generates the CRC at thetransmitter and checks it at the receiver. The Linklayer requires the use of 8 bits of CRC within each80-bit Flit. An optional rolling 16-bit CRC methodis available for the most demanding RASapplications. The CRC protects the entireinformation flow across the link; all signals arecovered. The receiving Link layer performs theCRC calculation on the incoming Flit and if there isan error, initiates a link level retry. If an error isdetected, the Link layer automatically requests

that the sender backup and retransmit the Flitsthat were not properly received. The retry processis initiated by a special cycle Control Flit, which issent back to the transmitter to start the backupand retry process. The Link layer initiates andcontrols this process; no software intervention isneeded. The Link layer reports the retry event tothe software stack for error tracking andpredictive maintenance algorithms. Table 4 detailssome of the error detection and recoverycapabilities provided by the Intel QuickPath

Interconnect. These capabilities are a keyelement of meeting the RAS requirements for newplatforms.

S NP

Protocol Layer (Snoop, Home, Response)

MessageClass

H OM

N C S

N C B

V N1

V NA

V N 0

VirtualNetwork

PHY

Packet (message) Credit sent to transmitter

Packets sent by transmitter

Tx buffer Rx buffer

Rx buffer full

credit counter


14/22


Table 4. RAS: Error Detection andRecovery Features

1 Lower is better. The cycle penalty increases latency andreduces bandwidth utilization.

In addition to error detection and recovery, theLink layer controls the use of some additional RASfeatures in the Physical layer, such as self-healinglinks and clock fail-over. Self-healing allows theinterconnect to recover from multiple hard errorswith no loss of data or software intervention.Unrecoverable soft errors will initiate a dynamiclink width reduction cycle. The sender andreceiver negotiate to connect through a reducedlink width. The link automatically reduces itswidth to either half or quarter-width, based onwhich quadrants (five lanes) of the link are good.As the data transmission is retried in the process,no data loss results. Software is notified of theevents, but need not directly intervene or controlthe operation. In the case of the clock lanesuccumbing to errors, the clock is mapped to apre-defined data lane and the link continues tooperate at half-width mode. If the data lane thatis serving as the fail-over is not functional, thereis a second fail-over clock path. Both the self-healing and clock fail-over functionalities aredirection independent. This means one directionof the link could be in a RAS mode, and the otherdirection could still be operating at full capacity.In these RAS modes, there is no loss ofinterconnect functionality or error protection. Theonly impact is reduced bandwidth of the link thatis operating in RAS mode.

Routing Layer

The Routing layer is used to determine the coursethat a packet will traverse across the available

system interconnects. Routing tables are definedby firmware and describe the possible paths thata packet can follow. In small configurations, suchas a two-socket platform, the routing options arelimited and the routing tables quite simple. Forlarger systems, the routing table options aremore complex, giving the flexibility of routing andrerouting traffic depending on how (1) devices arepopulated in the platform, (2) system resourcesare partitioned, and (3) RAS events result inmapping around a failing resource.

Transport Layer

The optional Transport layer provides end-to-endtransmission reliability. This architecturallydefined layer is not part of Intels initial productimplementations and is being considered forfuture products.

FeatureIntel QuickPath

Interconnect

CRC checking RequiredAll signals covered Yes

CRC type 8 bits / 80 bits

CRC bits/64-B packet 72

Impact of error Recovered

Cycle penalty 1 (bit times) None

Additional features Rolling CRC


15/22


Protocol Layer

In this layer, the packet is defined as the unit oftransfer. The packet contents definition is

standardized with some flexibility allowed to meetdiffering market segment requirements. Thepackets are categorized into six different classes,as mentioned earlier in this paper: home, snoop,data response, non-data response, non-coherentstandard, and non-coherent bypass. The requestsand responses affect either the coherent systemmemory space or are used for non-coherenttransactions (such as configuration, memory-mapped I/O, interrupts, and messages betweenagents).

The systems cache coherency across alldistributed caches and integrated memorycontrollers is maintained by all the distributedagents that participate in the coherent memoryspace transactions, subject to the rules defined bythis layer. The Intel QuickPath Interconnectcoherency protocol allows both home snoop andsource snoop behaviors. Home snoop behavior isoptimized for greater scalability, whereas sourcesnoop is optimized for lower latency. The latter is

used primarily in smaller scale systems where thesmaller number of agents creates a relatively lowamount of snoop traffic. Larger systems withmore snoop agents could develop a significantamount of snoop traffic and hence would benefitfrom a home snoop mode of operation. As part ofthe coherence scheme, the Intel QuickPathInterconnect implements the popular MESI 2 protocol and, optionally, introduces a new F-state.

MESIF

The Intel QuickPath Interconnect implements amodified format of the MESI coherence protocol.

The standard MESI protocol maintains everycache line in one of four states: modified,exclusive, shared, or invalid. A new read-onlyforward state has also been introduced to enablecache-to-cache clean line forwarding.Characteristics of these states are summarized inTable 5 . Only one agent can have a line in this F-state at any given time; the other agents canhave S-state copies. Even when a cache line hasbeen forwarded in this state, the home agent stillneeds to respond with a completion to allow

retirement of the resources tracking thetransaction. However, cache-to-cache transfersoffer a low-latency path for returning data otherthan that from the home agents memory.

Table 5. Cache States

2. The MESI protocol (pronounced messy), [named]after the four states of its cache lines: Modified,Exclusive, Shared, and Invalid. - In Search of Clusters,by Gregory Pfister.

State Clean/DirtyMay

Write?May

Forward?

MayTransition

To?

M Modified Dirty Yes Yes -

E Exclusive Clean Yes Yes MSIF

S Shared Clean No No II Invalid - No No -

F Forward Clean No Yes SI


16/22


Protocol Agents

The Intel QuickPath Interconnect coherencyprotocol consists of two distinct types of agents:

caching agents and home agents. A micro-processor will typically have both types of agentsand possibly multiple agents of each type.

A caching agent represents an entity which mayinitiate transactions into coherent memory, andwhich may retain copies in its own cachestructure. The caching agent is defined by themessages it may sink and source according to thebehaviors defined in the cache coherenceprotocol. A caching agent can also provide copies

of the coherent memory contents to other cachingagents.

A home agent represents an entity which servicescoherent transactions, including handshaking asnecessary with caching agents. A home agentsupervises a portion of the coherent memory.Home agent logic is not specifically the memorycontroller circuits for main memory, but rather theadditional Intel QuickPath Interconnect logicwhich maintains the coherency for a given

address space. It is responsible for managing theconflicts that might arise among the differentcaching agents. It provides the appropriate dataand ownership responses as required by a giventransactions flow.

There are two basic types of snoop behaviorssupported by the Intel QuickPath Interconnectspecification. Which snooping style isimplemented is a processor architecture specificoptimization decision. To over-simplify, sourcesnoop offers the lowest latency for small multi-processor configurations. Home snooping offersoptimization for the best performance in systemswith a high number of agents.

The next two sections illustrate these snoopingbehaviors.


17/22


Home Snoop

The home snoop coherency behavior defines thehome agent as responsible for the snooping of

other caching agents. The basic flow for amessage involves up to four steps (seeFigure 12 ). This flow is sometimes referred to asa three-hop snoop because the data is deliveredin step 3. To illustrate, using a simplified readrequest to an address managed by a remotehome agent, the steps are:

1. The caching agent issues a request to thehome agent that manages the memory inquestion.

2. The home agent uses its directory structure totarget a snoop to the caching agent that mayhave a copy of the memory in question.

3. The caching agent responds back to the homeagent with the status of the address. In thisexample, processor #3 has a copy of the linein the proper state, so the data is delivereddirectly to the requesting cache agent.

4. The home agent resolves any conflicts, and ifnecessary, returns the data to the original

requesting cache agent (after first checking tosee if data was delivered by another cachingagent, which in this case it was), andcompletes the transaction.

The Intel QuickPath Interconnect home snoopbehavior implementation typically includes adirectory structure to target the snoop to thespecific caching agents that may have a copy ofthe data. This has the effect of reducing thenumber of snoops and snoop responses that thehome agent has to deal with on the interconnectfabric. This is very useful in systems that have alarge number of agents, although it comes at theexpense of latency and complexity. Therefore,home snoop is targeted at systems optimized fora large number of agents.

Figure 12. Home Snoop Example

processor #1

processor #3

processor #4

processor #2

processor #1

processor #3

processor #4

processor #2

processor #1

processor #3

processor #4

processor #2

Step 1.P1 requests data which ismanaged by the P4 homeagent. (Note that P3 has acopy of the line.)

Step 2.P4 (home agent) checks thedirectory and sends snoop requestsonly to P3.

Step 3.P3 responds to the snoop byindicating to P4 that it has sentthe data to P1. P3 provides thedata back to P1.

processor #1

processor #3

processor #4

processor #2

Step 4.P4 provides the completion of the transaction.

Labels: P1 is the requesting caching agentP2 and P3 are peer caching agentsP4 is the home agent for the line

Precondition: P3 has a copy of the line in either M, E or F-state


18/22


Source Snoop

The source snoop coherency behavior streamlinesthe completion of a transaction by allowing the

source of the request to issue both the requestand any required snoop messages. The basic flowfor a message involves only three steps,sometimes referred to as a two-hop snoop sincedata can be delivered in step 2. Refer toFigure 13 . Using the same read request discussedin the previous section, the steps are:

1. The caching agent issues a request to thehome agent that manages the memory inquestion and issues snoops to all the othercaching agents to see if they have copies ofthe memory in question.

2. The caching agents respond to the homeagent with the status of the address. In thisexample, processor #3 has a copy of the linein the proper state, so the data is delivereddirectly to the requesting cache agent.

3. The home agent resolves any conflicts andcompletes the transaction.

The source snoop behavior saves a hop, thereby

offering a lower latency. This comes at theexpense of requiring agents to maintain a lowlatency path to receive and respond to snooprequests; it also imparts additional bandwidthstress on the interconnect fabric, relative to thehome snoop method. Therefore, the source snoopbehavior is most effective in platforms with only afew agents.

Figure 13. Source Snoop Example

processor #1

processor #3

processor #4

processor #2

processor #1

processor #3

processor #4

processor #2

processor #1

processor #3

processor #4

processor #2

Step 1.P1 requests data which is managedby the P4 home agent. (Note thatP3 has a copy of the line.)

Step 2.P2 and P3 respond to snooprequest to P4 (home agent).P3 provides data back to P1.

Step 3.P4 provides the completion ofthe transaction.

Labels: P1 is the requesting caching agentP2 and P3 are peer caching agentsP4 is the home agent for the line

Precondition: P3 has a copy of the line in either M, E or F-state


19/22


Performance

Much is made about the relationship between theprocessor interconnect and overall processor

performance. There is not always a directcorrelation between interconnect bandwidth,latency and processor performance. Instead, theinterconnect must perform at a level that will notlimit processor performance. Microprocessorperformance has been continually improving byarchitectural enhancements, multiple cores, andincreases in processor frequency. It is importantthat the performance of the interconnect remainsufficient to support the latent capabilities of aprocessors architecture. However, providing more

interconnect bandwidth or better latency by itselfdoes not equate to an increase in performance.There are a few benchmarks which create asynthetic stress on the interconnect and can beused to see a more direct correlation betweenprocessor performance and interconnectperformance, but they are not good proxies forreal end-user performance. The Intel QuickPathInterconnect offers a high-bandwidth and low-latency interconnect solution. The raw bandwidthof the link is 2X the bandwidth of the previousIntel Xeon processor front-side bus. To the firstorder, it is clear to see the increase in bandwidththat the Intel QuickPath Interconnect offers.

Raw Bandwidth

Raw and sustainable bandwidths are verydifferent and have varying definitions. The rawbandwidth, or maximum theoretical bandwidth,is the rate at which data can be transferredacross the connection without any accounting for

the packet structure overhead or other effects.The simple calculation is the number of bytesthat can be transferred per second. The Intel QuickPath Interconnect is a double-pumped databus, meaning data is captured at the rate of onedata transfer per edge of the forwarded clock. Soevery clock period captures two chunks of data.The maximum amount of data sent across a full-

width Intel QuickPath Interconnect is 16 bits,or 2 bytes. Note that 16 bits are used for thiscalculation, not 20 bits. Although the link has upto 20 1-bit lanes, no more than 16 bits of real

data payload are ever transmitted at a time, sothe more accurate calculation would be to use 16bits of data at a time. The maximum frequencyof the initial Intel QuickPath Interconnectimplementation is 3.2 GHz. This yields a double-pumped data rate of 6.4 GT/s with 2 bytes pertransition, or 12.8 GB/s. An Intel QuickPathInterconnect link pair operates two uni-directional links simultaneously, which gives afinal theoretical raw bandwidth of 25.6 GB/s.Using similar calculations, Table 6 shows theprocessors bus bandwidth per port.

Table 6. Processor Bus BandwidthComparison, Per Port

1 Source: Intel internal presentation, December 2007.

Packet Overhead

The maximum theoretical bandwidth is notsustainable in a real system. Bandwidth isimplementation specific. A step closer to

sustainable bandwidth is accounting for thepacket overhead. Although still not a truemeasurable result, it is closer to real-lifebandwidth. The typical Intel QuickPathInterconnect packet has a header Flit whichrequires four Phits to send across the link. Thetypical data transaction in microprocessors is a64-byte cache line. The data payload, therefore,

Bandwidth(Higher is better)

I n t e l

F r o n

t S i d e

B u s

I n t e l

Q u

i c k P a

t h

I n t e r c o n n e c

t

Year 2007 2008Rate (GT/s) 1.6 6.4

Width (bytes) 8 2

Bandwidth (GB/s) 12.8 25.6

Coherency Yes Yes


20/22


21/22


Reliability, Availability, andServiceability

A key difference between desktop computers and

servers is reliability, availability, and serviceability.Although all users want a reliable computer, theimpact of a problem is typically much greater on aserver running a business core application thanon a personal workstation. Therefore, processorsused in servers typically have more RAS featuresthan their client brethren. For example, moreexpensive ECC memory is standard in servers,whereas clients use the less expensive non-ECCDRAM. Likewise, on the processor interconnect,RAS features are important to a server processor

interconnect, especially the larger expandableand mission-critical servers found in IT datacenters. The Intel QuickPath Interconnect wasdesigned with these types of applications in mindand its architecture includes several importantRAS features. Product developers will optimizethe overall system design by including the set ofRAS features as needed to meet the requirementsof a particular platform. As such, the inclusion ofsome of these features is product specific.

One feature common across all Intel QuickPathInterconnect implementations is the use of CRC.The interconnect transmits 8 bits of CRC withevery Flit, providing error detection without alatency performance penalty. Other protocols useadditional cycles to send the CRC, thereforeimpacting performance. Many of these RASfeatures were identified previously in the Linklayer section of this document. These RASfeatures include retry mode and per-Flit CRC, aswell as product-specific features such as rolling

CRC, hot detect, link self healing, and clock fail-over.

Processor Bus Applications

In recent years, Intel has enabled innovativesilicon designs that are closely coupled to the

processor. For years, the processor bus (front-side bus) has been licensed to companies todevelop chipsets that allow servers with eight,sixteen, thirty-two, or more processors. Morerecently, several companies have developedproducts that allow for a re-programmable FPGAmodule to plug into an existing Intel Xeon processor socket. These innovative acceleratorapplications allow for significant speed-up of nichealgorithms in areas like financial analytics, imageprocessing, and oil and gas recovery. These

innovative applications will continue on Intelsnew processor platforms that use the Intel QuickPath Interconnect instead of the front-sidebus. Unlike high-performance I/O and acceleratordesigns found on PCI Express*, these innovativeproducts require the benefits of close coupling tothe processor, such as ultra-low latency and cachecoherency. The bulk of high-performance I/O andaccelerator designs are found on the PCI Express*bus, which has an open license available throughthe PCI SIG. This bus, however, has the limitationof being non-coherent. For applications requiringcoherency, the Intel QuickPath Interconnecttechnology could be licensed from Intel. As newinnovative accelerator applications are developed,they can take advantage of PCI Express* for thebroadest set of applications and Intel QuickPathInterconnect for the remaining cache-coherentniche.


22/22

Summary

With its high-bandwidth, low-latencycharacteristics, the Intel QuickPath Interconnect

advances the processor bus evolution, unlockingthe potential of next-generation microprocessors.The 25.6 GB/s of low-latency bandwidth providesthe basic fabric required for distributed sharedmemory architectures.The inline CRC codesprovide more error coverage with less overheadthan serial CRC approaches. Features such aslane reversal, polarity reversal, clock fail-over,and self-healing links ease design of highlyreliable and available products.The two-hop,source snoop behavior with cache line forwarding

offers the shortest request completion inmainstream systems, while the home snoopbehavior allows for optimizing highly scalableservers. With all these features and performanceits no surprise that various vendors are designinginnovative products around this interconnecttechnology and that it provides the foundation forthe next generation of Intel microprocessors andmany more to come.

Documents

Quick Path Interconnect Introduction Paper