Quick Path Interconnect Introduction Paper

Embed Size (px)

Citation preview

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    1/22

    Document Number: 320412-001US

    An Introduction to the Intel QuickPath Interconnect

    Ja n u a r y 2 0 0 9

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    2/22

    2 An Introduction to the Intel QuickPath Interconnect

    Legal Lines andDisclaimers

    INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED,BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT ASPROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER,AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDINGLIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANYPATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Intel products are not intended for use in medical, life saving,life sustaining, critical control or safety systems, or in nuclear facility applications.

    Intel may make changes to specifications and product descriptions at any time, without notice.

    Designers must not rely on the absence or characteristics of any features or instructions marked reserved or undefined. Intelreserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from futurechanges to them.

    Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family,not across different processor families. See http://www.intel.com/products/processor_number for details.

    Products which implement the Intel QuickPath Interconnect may contain design defects or errors known as errata which maycause the product to deviate f rom published specifications. Current characterized errata are available on request.

    Any code names presented in this document are only for use by Intel t o identify products, technologies, or services indevelopment, that have not been made commercially available to the public, i.e., announced, launched or shipped. They are not

    "commercial" names for products or services and are not intended to function as trademarks.Contact your local Intel sales of fice or your distributor to obtain the latest specifications and before placing your product order.

    Copies of documents which have an order number and are referenced in this document, or other Intel literature may be obtainedby calling 1-800-548-4725 or by visiting Intel's website at http://www.intel.com .

    Intel, Core, Pentium Pro, Xeon, Intel Interconnect BIST (Intel BIST), and the Intel logo are trademarks of Intel Corporation in theUnited States and other countries.

    *Other names and brands may be claimed as the property of others.

    Copyright 2009, Intel Corporation. All Rights Reserved.

    Notice: This document contains information on products in the design phase of development. The information here is subject to change withoutnotice. Do not finalize a design with this information.

    http://www.intel.com/products/processor_numberhttp://www.intel.com/http://www.intel.com/http://www.intel.com/products/processor_number
  • 8/12/2019 Quick Path Interconnect Introduction Paper

    3/22

    An Introduction to the Intel QuickPath Interconnect 3

    Contents

    Executive Overview ..............................................................................................................5

    Introduction ........................................................................................................................5

    Paper Scope and Organization .................................................... ........................................... 6

    Evolution of Processor Interface .................................................... ......................................... 6

    Interconnect Overview ..........................................................................................................8

    Interconnect Details .................................................. .......................................................... 10

    Physical Layer ......................................................... ..................................................... ..... 10

    Link Layer ................................................. .................................................... .................... 12

    Routing Layer ................................................ .................................................... ................ 14

    Transport Layer .......................................................................... ....................................... 14

    Protocol Layer ....................................................... ................................................... ......... 15

    Performance ........................................... ................................................... ........................ 19

    Reliability, Availability, and Serviceability ....................................................... ........................ 21

    Processor Bus Applications ....... ........................................................... ................................ 21

    Summary ......................................... ................................................. ................................ 22

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    4/22

    4 An Introduction to the Intel QuickPath Interconnect

    Revision History

    DocumentNumber

    RevisionNumber Description Date

    320412 -001US Initial Release January 2009

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    5/22

    An Introduction to the Intel QuickPath Interconnect 5

    Executive Overview

    Intel microprocessors advance theirperformance ascension through ongoing

    microarchitecture evolutions and multi-coreproliferation. The processor interconnect hassimilarly evolved (see Figure 1 ), thereby keepingpace with microprocessor needs through fasterbuses, quad-pumped buses, dual independentbuses (DIB), dedicated high-speed interconnects(DHSI), and now the Intel QuickPathInterconnect.

    The Intel QuickPath Interconnect is a high-speed, packetized, point-to-point interconnect

    used in Intels next generation of microprocessorsfirst produced in the second half of 2008. Thenarrow high-speed links stitch togetherprocessors in a distributed shared memory 1 -styleplatform architecture. Compared with todayswide front-side buses, it offers much higherbandwidth with low latency. The Intel QuickPathInterconnect has an efficient architecture allowingmore interconnect performance to be achieved inreal systems. It has a snoop protocol optimizedfor low latency and high scalability, as well as

    packet and lane structures enabling quickcompletions of transactions. Reliability,availability, and serviceability features (RAS) arebuilt into the architecture to meet the needs ofeven the most mission-critical servers. With thiscompelling mix of performance and features, itsevident that the Intel QuickPath Interconnectprovides the foundation for future generations ofIntel microprocessors and that various vendorsare designing innovative products around thisinterconnect technology.

    Figure 1. Performance and BandwidthEvolution

    Introduction

    For the purpose of this paper, we will start ourdiscussion with the introduction of the Intel Pentium Pro processor in 1992. The Pentium Pro microprocessor was the first Intel architecturemicroprocessor to support symmetricmultiprocessing in various multiprocessorconfigurations. Since then, Intel has produced a

    steady flow of products that are designed to meetthe rigorous needs of servers and workstations.In 2008, just over fifteen years after thePentium Pro processor was introduced, Inteltook another big step in enterprise computingwith the production of processors for the nextgeneration of Intel 45-nm Hi-k Intel Coremicroarchitecture.

    1. Scalable shared memory or distributed shared-memory (DSM) architectures are terms from Hennessy& Patterson, Computing Architecture: A QuantitativeApproach. It can also be referred to as NUMA, non-uniform memory access.

    1985 1990 1995 2000 2005 2010

    Years

    R e

    l a t i v e

    I n t e g e

    t T h r o u g

    h p u t

    Quad-PumpedBus Pentium 4

    64-bitData BusPentium

    DHSI IntelCore

    DIB

    Double-PumpedBus

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    6/22

    6 An Introduction to the Intel QuickPath Interconnect

    Figure 2. Intel QuickPath Architecture

    The processors based on next-generation, 45-nmHi-k Intel Core microarchitecture also utilize anew system of framework for Intel micro-processors called the Intel QuickPathArchitecture (see Figure 2 ). This architecturegenerally includes memory controllers integratedinto the microprocessors, which are connectedtogether with a high-speed, point-to-pointinterconnect. The new Intel QuickPathInterconnect provides high bandwidth and lowlatency, which deliver the interconnectperformance needed to unleash the newmicroarchitecture and deliver the Reliability,Availability, and Serviceability (RAS) featuresexpected in enterprise applications. This newinterconnect is one piece of a balanced platformapproach to achieving superior performance. It isa key ingredient in keeping pace with the nextgeneration of microprocessors.

    Paper Scope and Organization

    The remainder of this paper will describe featuresof the Intel QuickPath Interconnect. First, ashort overview of the evolution of the processorinterface, including the Intel QuickPathInterconnect, is provided then each of the Intel

    QuickPath Interconnect architectural layers isdefined, an overview of the coherency protocoldescribed, board layout features surveyed, and

    some initial performance expectations provided.The primary audience for this paper is thetechnology enthusiast as well as other individualswho want to understand the future direction ofthis interconnect technology and the benefits itoffers. Although all efforts are made to ensureaccuracy, the information provided in this paper isbrief and should not be used to make designdecisions. Please contact your Intel representativefor technical design collateral, as appropriate.

    Evolution of Processor Interface

    In older shared bus systems, as shown inFigure 3 , all traffic is sent across a single sharedbi-directional bus, also known as the front-sidebus (FSB). These wide buses (64-bit for Intel Xeon processors and 128-bit for Intel Itanium processors) bring in multiple data bytesat a time. The challenge to this approach was theelectrical constraints encountered with increasingthe frequency of the wide source synchronousbuses. To get around this, Intel evolved the busthrough a series of technology improvements.

    Initially in the late 1990s, data was clocked in at

    2X the bus clock, also called double-pumped.Todays Intel Xeon processor FSBs are quad-pumped, bringing the data in at 4X the bus clock.The top theoretical data rate on FSBs today is1.6 GT/s.

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    7/22

    An Introduction to the Intel QuickPath Interconnect 7

    Figure 3. Shared Front-side Bus, upuntil 2004

    To further increase the bandwidth of the front-side bus based platforms, the single-shared busapproach evolved into dual independent buses(DIB), as depicted in Figure 4 . DIB designsessentially doubled the available bandwidth.However, all snoop traffic had to be broadcast onboth buses, and if left unchecked, would reduceeffective bandwidth. To minimize this problem,

    snoop filters were employed in the chipset tocache snoop information, thereby significantlyreducing bandwidth loading.

    Figure 4. Dual Independent Buses,circa 2005

    The DIB approach was extended to its logicalconclusion with the introduction of dedicatedhigh-speed interconnects (DHSI), as shown inFigure 5 . DHSI-based platforms use four FSBs,one for each processor in the platform. Again,snoop filters were employed to achieve bandwidthscaling.

    Figure 5. Dedicated High-speedInterconnects, 2007

    processor processor processor

    chipsetMemoryInterface

    I/O

    processor

    Up to 4.2GB/sPlatform Bandwidth

    processor processor processor

    chipsetMemoryInterface

    I/O

    processor

    Up to 12.8GB/sPlatform Bandwidth

    snoop filter

    processor processor processor

    chipset

    Memory

    Interface

    I/O

    processor

    Up to 34GB/sPlatform Bandwidth

    snoop filter

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    8/22

    8 An Introduction to the Intel QuickPath Interconnect

    With the production of processors based on next-generation, 45-nm Hi-k Intel Coremicroarchitecture, the Intel Xeon processorfabric will transition from a DHSI, with thememory controller in the chipset, to a distributedshared memory architecture using Intel QuickPath Interconnects. This configuration isshown in Figure 6 . With its narrow uni-directionallinks based on differential signaling, the Intel QuickPath Interconnect is able to achievesubstantially higher signaling rates, therebydelivering the processor interconnect bandwidthnecessary to meet the demands of futureprocessor generations.

    Figure 6. Intel QuickPathInterconnect

    Interconnect Overview

    The Intel QuickPath Interconnect is a high-speed point-to-point interconnect. Thoughsometimes classified as a serial bus, it is moreaccurately considered a point-to-point link as datais sent in parallel across multiple lanes andpackets are broken into multiple paralleltransfers. It is a contemporary design that uses

    some techniques similar to other point-to-pointinterconnects, such as PCI Express* and Fully-Buffered DIMMs. There are, of course, somenotable differences between these approaches,which reflect the fact that these interconnectswere designed for different applications. Some ofthese similarities and differences will be exploredlater in this paper.

    Figure 7 shows a schematic of a processor withexternal Intel QuickPath Interconnects. Theprocessor may have one or more cores. Whenmultiple cores are present, they may sharecaches or have separate caches. The processoralso typically has one or more integrated memory

    controllers. Based on the level of scalabilitysupported in the processor, it may include anintegrated crossbar router and more than oneIntel QuickPath Interconnect port (a portcontains a pair of uni-directional links).

    Figure 7. Block Diagram of Processorwith Intel QuickPathInterconnects

    MemoryInterface

    I/O

    MemoryInterface

    MemoryInterface

    MemoryInterface

    chipset

    I/O

    chipset

    processor

    processor processor

    processor

    Legend:

    Bi-directional busUni-directional link

    core core core

    I n t e g r a t e d

    M e m o r y C o n t r o l l e r ( s )

    Crossbar Router /Non-routing

    global links interface

    MemoryInterface

    Processor Cores

    Intel

    QuickPathinterconnects

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    9/22

    An Introduction to the Intel QuickPath Interconnect 9

    The physical connectivity of each interconnect linkis made up of twenty differential signal pairs plusa differential forwarded clock. Each port supportsa link pair consisting of two uni-directional links tocomplete the connection between twocomponents. This supports traffic in bothdirections simultaneously. To facilitate flexibilityand longevity, the interconnect is defined ashaving five layers (see Figure 8 ): Physical, Link,Routing, Transport, and Protocol.

    The Physical layer consists of the actual wirescarrying the signals, as well as circuitry andlogic to support ancillary features required inthe transmission and receipt of the 1s and 0s.

    The unit of transfer at the Physical layer is 20-bits, which is called a Phit (for Physical unit).

    The next layer up the stack is the Link layer,which is responsible for reliable transmissionand flow control. The Link layers unit oftransfer is an 80-bit Flit (for Flow control unit).

    The Routing layer provides the framework fordirecting packets through the fabric.

    The Transport layer is an architecturallydefined layer (not implemented in the initial

    products) providing advanced routingcapability for reliable end-to-endtransmission.

    The Protocol layer is the high-level set of rulesfor exchanging packets of data betweendevices. A packet is comprised of an integralnumber of Flits.

    The Intel QuickPath Interconnect includes acache coherency protocol to keep the distributedmemory and caching structures coherent during

    system operation. It supports both low-latencysource snooping and a scalable home snoopbehavior. The coherency protocol provides fordirect cache-to-cache transfers for optimallatency.

    Figure 8. Architectural Layers of theIntel QuickPathInterconnect

    Within the Physical layer are several features(such as lane/polarity reversal, data recovery anddeskew circuits, and waveform equalization) thatease the design of the high-speed link. Initial bitrates supported are 4.8 GT/s and 6.4 GT/s. Withthe ability to transfer 16 bits of data payload perforwarded clock edge, this translates to 19.2 GB/sand 25.6 GB/s of theoretical peak data bandwidthper link pair.

    At these high bit rates, RAS requirements are metthrough advanced features which include: CRCerror detection, link-level retry for error recovery,hot-plug support, clock fail-over, and link self-healing. With this combination of features,

    performance, and modularity, the Intel QuickPath Interconnect provides a high-bandwidth, low-latency interconnect solutioncapable of unleashing the performance potentialof the next-generation of Intel microarchitecture.

    Protocol

    Device #1

    Transport

    Routing

    Link

    Physical

    Tx Rx

    Protocol

    Device #2

    Transport

    Routing

    Link

    Physical

    Tx Rx

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    10/22

    10 An Introduction to the Intel QuickPath Interconnect

    Interconnect Details

    The remainder of this paper describes the Intel QuickPath Interconnect in more detail. Each of the

    layers will be defined, an overview of thecoherency protocol described, board layoutfeatures surveyed and some initial thoughts onperformance provided.

    Physical Layer

    Figure 9. Physical Layer Diagram

    The Physical layer consists of the actual wirescarrying the signals, as well as the circuitry andlogic required to provide all features related to thetransmission and receipt of the information

    transferred across the link. A link pair consists oftwo uni-directional links that operatesimultaneously. Each full link is comprised oftwenty 1-bit lanes that use differential signalingand are DC coupled. The specification definesoperation in full, half, and quarter widths. Theoperational width is identified during initialization,which can be initiated during operation as part ofa RAS event. A Phit contains all the informationtransferred by the Physical layer on a single clockedge. At full-width that would be 20 bits, at half-

    width 10 bits and at quarter-width 5 bits. A Flit(see the Link Layer section) is always 80 bitsregardless of the link width, so the number ofPhits needed to transmit a Flit will increase by afactor of two or four for half and quarter-widthlinks, respectively.

    Each link also has one and only one forwardedclock. The clock lane is required for eachdirection. In all, there are a total of eighty-four(84) individual signals to make up a single Intel QuickPath Interconnect port. This is significantlyfewer signals than the wider, 64-bit front-sidebus, which has approximately 150 pins. Alltransactions are encapsulated across these links,including configuration and all interrupts. Thereare no side-band signals. Table 1 compares twodifferent interconnect technologies.

    Table 1. Processor InterconnectComparison

    1 Diff. stands for differential signaling.2 SrcSync stands for source synchronous data sampling.FwdClk(s) means forwarded clock(s).3 Source: Intel internal presentation, December, 2007.

    The link is operated at a double-data rate,meaning the data bit rate is twice the forwardedclock frequency. The forwarded clock is a separatesignal, not an encoded clock as used in PCI

    Express* Gen 1 and Gen 2. The initial productimplementations are targeting bit rates of 6.4GT/s and 4.8 GT/s. Table 2 compares the bitrates of different interconnect technologies.

    Data Signal Pairs20

    Clock Signal Pair

    Device #1

    Rx

    Device #2

    Rx

    Data Signal Pairs 20

    Clock Signal Pair

    Tx

    Tx I n t e l

    F r o n

    t - S i d e

    B u s

    3

    I n t e l

    Q u

    i c k P a t h

    I n t e r c o n n e c t 3

    Topology Bus Link

    Signaling Technology 1 GTL+ Diff.

    Rx Data Sampling 2 SrcSync FwdClk

    Bus Width (bits) 64 20

    Max Data Transfer Width 64 16

    Requires Side-band Signals Yes No

    Total Number of Pins 150 84

    Clocks Per Bus 1 1

    Bi-directional Bus Yes No

    Coupling DC DC

    Requires 8/10-bit encoding No No

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    11/22

    An Introduction to the Intel QuickPath Interconnect 11

    Table 2. Contemporary Interconnect BitRates

    1 Transfer rate available on Intel Xeon Processor-basedWorkstation Platform.2 Transfer rate available on Intel Xeon Processor-basedWorkstation Platform.3 Source: PCI Express* 3.0 Frequently Asked Questions, PCI-SIG, August 2007.

    The receivers include data recovery circuitry

    which can allow up to several bit-time intervals ofskew between data lanes and the forwardedclock. De-skew is performed on each laneindependently. The amount of skew that can betolerated is product-specific and design collateralshould be consulted for specifics. On thetransmitter side, waveform equalization is used toobtain proper electrical characteristics. Signalintegrity simulation is required to define the tapsettings for waveform equalization. Routinglengths are also product and frequency specific.

    One implementation example would be that therouting length could vary from 14 to 24 withzero to two connectors and 6.4 GT/s to 4.8 GT/s.Again, consult the design collateral and your Inteltechnical representative for specifics. To easerouting constraints, both lane and polarityreversal are supported. Polarity reversal allows anindividual lanes differential pair to be swapped.In addition to reversing a lane, an entire link canbe reversed to ease routing. So lanes 0 to 19 canbe connected to 19 to 0 and the initialization logic

    would adjust for the proper data transfer. Thisaids board routing by avoiding cross-over ofsignals. Both types of reversal are discoveredautomatically by the receiver during initialization.The logic configures itself appropriately toaccommodate the reversal without any need forexternal means, such as straps or configurationregisters.

    Logic circuits in the Physical layer are responsiblefor the link reset, initialization and training. Likeother high-speed links, the Physical layer isdesigned to expect a low rate of bit errors due torandom noise and jitter in the system. Theseerrors are routinely detected and correctedthrough functions in the Link layer. The Physicallayer also performs periodic retraining to avoidthe Bit Error Rate (BER) from exceeding thespecified threshold. The Physical layer supportsloop-back test modes using the Intel Interconnect Built-In Self Test (Intel IBIST).Intel IBIST tools provide a mechanism fortesting the entire interconnect path at fulloperational speed without the need for externaltest equipment. These tools greatly assist in thevalidation efforts required to ensure properplatform design integrity.

    Technology Bit Rate (GT/s)

    Intel front-side bus 1.6 1

    Intel QuickPath Int erconnect 6.4/4.8

    Fully-Buffered DIMM 4.0 2

    PCI Express* Gen1 3 2.5

    PCI Express* Gen2 3 5.0

    PCI Express* Gen3 3 8.0

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    12/22

    12 An Introduction to the Intel QuickPath Interconnect

    Link Layer

    The Link layer has three main responsibilities: (1)it guarantees reliable data transfer between two

    Intel QuickPath Interconnect protocol or routingentities; (2) it is responsible for the flow controlbetween two protocol agents; and (3) inabstracting the Physical layer, it provides servicesto the higher layers.

    The smallest unit of measure at the Link layer is aFlit (flow control unit). Each Flit is 80 bits long,which on a full-width link would translate to fourPhits. The Link layer presents a set of higher-levelservices to the stack. These services include

    multiple message classes and multiple virtualnetworks, and together are used to preventprotocol deadlocks.

    Message Classes

    The Link layer supports up to fourteen (14)Protocol layer message classes of which six arecurrently defined. The remaining eight messageclasses are reserved for future use. The messageclasses provide independent transmissionchannels (virtual channels) to the Protocol layer,thereby allowing sharing of the physical channel.

    The message classes are shown in Table 3 . Themessages with the SNP, NDR, DRS, NCS and NCBmessage encodings are unordered. An unorderedchannel has no required relationship between theorder in which messages are sent on that channeland the order in which they are received. TheHOM message class does have some orderingrequirements with which components are requiredto comply.

    Table 3. Message Classes

    Virtual Networks

    Virtual networks provide the Link layer with anadditional method for replicating each messageclass into independent virtual channels. Virtualnetworks facilitate the support of a variety offeatures, including reliable routing, support for

    complex network topologies, and a reduction inrequired buffering through adaptively bufferedvirtual networks.

    The Link layer supports up to three virtualnetworks. Each message class is subdividedamong the three virtual networks. There are up totwo independently buffered virtual networks (VN0and VN1) and one shared adaptive bufferedvirtual network (VNA). See Figure 10 . The totalnumber of virtual channels supported is the

    product of the virtual networks supported and themessage classes supported. For the Intel QuickPath Interconnect Link layer this is amaximum of eighteen virtual channels (three SNP,three HOM, three NDR, three DRS, three NCS,and three NCB).

    Name Abbr Ordering Data

    Snoop SNP None NoHome HOM Required No

    Non-data Response NDR None No

    Data Response DRS None Yes

    Non-coherent Standard NCS None No

    Non-coherent Bypass NCB None Yes

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    13/22

    An Introduction to the Intel QuickPath Interconnect 13

    Figure 10. Virtual Networks

    Note : Only 4 of 6 message classes shown.

    Credit/Debit Scheme

    The Link layer uses a credit/debit scheme for flow

    control, as depicted in Figure 11 . Duringinitialization, a sender is given a set number ofcredits to send packets, or Flits, to a receiver.Whenever a packet or Flit is sent to the receiver,the sender decrements its credit counters by onecredit, which can represent either a packet or aFlit depending on the type of virtual networkbeing used. Whenever a buffer is freed at thereceiver, a credit is returned to the sender for thatbuffer. When the senders credits for a givenchannel have been exhausted, it stops sending on

    that channel. Each packet contains an embeddedflow control stream. This flow control streamreturns credits from a receiving Link layer entityto a sending Link layer entity. Credits are returnedafter the receiving Link layer has consumed thereceived information, freed the appropriatebuffers, and is ready to receive more informationinto those buffers.

    Figure 11. Credit/Debit Flow Control

    Reliable Transmission

    CRC error checking and recovery procedures areprovided by the Link layer to isolate the effects ofroutine bit errors (which occur on the physicalinterconnect) from having any higher layer

    impact. The Link layer generates the CRC at thetransmitter and checks it at the receiver. The Linklayer requires the use of 8 bits of CRC within each80-bit Flit. An optional rolling 16-bit CRC methodis available for the most demanding RASapplications. The CRC protects the entireinformation flow across the link; all signals arecovered. The receiving Link layer performs theCRC calculation on the incoming Flit and if there isan error, initiates a link level retry. If an error isdetected, the Link layer automatically requests

    that the sender backup and retransmit the Flitsthat were not properly received. The retry processis initiated by a special cycle Control Flit, which issent back to the transmitter to start the backupand retry process. The Link layer initiates andcontrols this process; no software intervention isneeded. The Link layer reports the retry event tothe software stack for error tracking andpredictive maintenance algorithms. Table 4 detailssome of the error detection and recoverycapabilities provided by the Intel QuickPath

    Interconnect. These capabilities are a keyelement of meeting the RAS requirements for newplatforms.

    S NP

    Protocol Layer (Snoop, Home, Response)

    MessageClass

    H OM

    N C S

    N C B

    V N1

    V NA

    V N 0

    VirtualNetwork

    PHY

    Packet (message) Credit sent to transmitter

    Packets sent by transmitter

    Tx buffer Rx buffer

    Rx buffer full

    credit counter

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    14/22

    14 An Introduction to the Intel QuickPath Interconnect

    Table 4. RAS: Error Detection andRecovery Features

    1 Lower is better. The cycle penalty increases latency andreduces bandwidth utilization.

    In addition to error detection and recovery, theLink layer controls the use of some additional RASfeatures in the Physical layer, such as self-healinglinks and clock fail-over. Self-healing allows theinterconnect to recover from multiple hard errorswith no loss of data or software intervention.Unrecoverable soft errors will initiate a dynamiclink width reduction cycle. The sender andreceiver negotiate to connect through a reducedlink width. The link automatically reduces itswidth to either half or quarter-width, based onwhich quadrants (five lanes) of the link are good.As the data transmission is retried in the process,no data loss results. Software is notified of theevents, but need not directly intervene or controlthe operation. In the case of the clock lanesuccumbing to errors, the clock is mapped to apre-defined data lane and the link continues tooperate at half-width mode. If the data lane thatis serving as the fail-over is not functional, thereis a second fail-over clock path. Both the self-healing and clock fail-over functionalities aredirection independent. This means one directionof the link could be in a RAS mode, and the otherdirection could still be operating at full capacity.In these RAS modes, there is no loss ofinterconnect functionality or error protection. Theonly impact is reduced bandwidth of the link thatis operating in RAS mode.

    Routing Layer

    The Routing layer is used to determine the coursethat a packet will traverse across the available

    system interconnects. Routing tables are definedby firmware and describe the possible paths thata packet can follow. In small configurations, suchas a two-socket platform, the routing options arelimited and the routing tables quite simple. Forlarger systems, the routing table options aremore complex, giving the flexibility of routing andrerouting traffic depending on how (1) devices arepopulated in the platform, (2) system resourcesare partitioned, and (3) RAS events result inmapping around a failing resource.

    Transport Layer

    The optional Transport layer provides end-to-endtransmission reliability. This architecturallydefined layer is not part of Intels initial productimplementations and is being considered forfuture products.

    FeatureIntel QuickPath

    Interconnect

    CRC checking RequiredAll signals covered Yes

    CRC type 8 bits / 80 bits

    CRC bits/64-B packet 72

    Impact of error Recovered

    Cycle penalty 1 (bit times) None

    Additional features Rolling CRC

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    15/22

    An Introduction to the Intel QuickPath Interconnect 15

    Protocol Layer

    In this layer, the packet is defined as the unit oftransfer. The packet contents definition is

    standardized with some flexibility allowed to meetdiffering market segment requirements. Thepackets are categorized into six different classes,as mentioned earlier in this paper: home, snoop,data response, non-data response, non-coherentstandard, and non-coherent bypass. The requestsand responses affect either the coherent systemmemory space or are used for non-coherenttransactions (such as configuration, memory-mapped I/O, interrupts, and messages betweenagents).

    The systems cache coherency across alldistributed caches and integrated memorycontrollers is maintained by all the distributedagents that participate in the coherent memoryspace transactions, subject to the rules defined bythis layer. The Intel QuickPath Interconnectcoherency protocol allows both home snoop andsource snoop behaviors. Home snoop behavior isoptimized for greater scalability, whereas sourcesnoop is optimized for lower latency. The latter is

    used primarily in smaller scale systems where thesmaller number of agents creates a relatively lowamount of snoop traffic. Larger systems withmore snoop agents could develop a significantamount of snoop traffic and hence would benefitfrom a home snoop mode of operation. As part ofthe coherence scheme, the Intel QuickPathInterconnect implements the popular MESI 2 protocol and, optionally, introduces a new F-state.

    MESIF

    The Intel QuickPath Interconnect implements amodified format of the MESI coherence protocol.

    The standard MESI protocol maintains everycache line in one of four states: modified,exclusive, shared, or invalid. A new read-onlyforward state has also been introduced to enablecache-to-cache clean line forwarding.Characteristics of these states are summarized inTable 5 . Only one agent can have a line in this F-state at any given time; the other agents canhave S-state copies. Even when a cache line hasbeen forwarded in this state, the home agent stillneeds to respond with a completion to allow

    retirement of the resources tracking thetransaction. However, cache-to-cache transfersoffer a low-latency path for returning data otherthan that from the home agents memory.

    Table 5. Cache States

    2. The MESI protocol (pronounced messy), [named]after the four states of its cache lines: Modified,Exclusive, Shared, and Invalid. - In Search of Clusters,by Gregory Pfister.

    State Clean/DirtyMay

    Write?May

    Forward?

    MayTransition

    To?

    M Modified Dirty Yes Yes -

    E Exclusive Clean Yes Yes MSIF

    S Shared Clean No No II Invalid - No No -

    F Forward Clean No Yes SI

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    16/22

    16 An Introduction to the Intel QuickPath Interconnect

    Protocol Agents

    The Intel QuickPath Interconnect coherencyprotocol consists of two distinct types of agents:

    caching agents and home agents. A micro-processor will typically have both types of agentsand possibly multiple agents of each type.

    A caching agent represents an entity which mayinitiate transactions into coherent memory, andwhich may retain copies in its own cachestructure. The caching agent is defined by themessages it may sink and source according to thebehaviors defined in the cache coherenceprotocol. A caching agent can also provide copies

    of the coherent memory contents to other cachingagents.

    A home agent represents an entity which servicescoherent transactions, including handshaking asnecessary with caching agents. A home agentsupervises a portion of the coherent memory.Home agent logic is not specifically the memorycontroller circuits for main memory, but rather theadditional Intel QuickPath Interconnect logicwhich maintains the coherency for a given

    address space. It is responsible for managing theconflicts that might arise among the differentcaching agents. It provides the appropriate dataand ownership responses as required by a giventransactions flow.

    There are two basic types of snoop behaviorssupported by the Intel QuickPath Interconnectspecification. Which snooping style isimplemented is a processor architecture specificoptimization decision. To over-simplify, sourcesnoop offers the lowest latency for small multi-processor configurations. Home snooping offersoptimization for the best performance in systemswith a high number of agents.

    The next two sections illustrate these snoopingbehaviors.

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    17/22

    An Introduction to the Intel QuickPath Interconnect 17

    Home Snoop

    The home snoop coherency behavior defines thehome agent as responsible for the snooping of

    other caching agents. The basic flow for amessage involves up to four steps (seeFigure 12 ). This flow is sometimes referred to asa three-hop snoop because the data is deliveredin step 3. To illustrate, using a simplified readrequest to an address managed by a remotehome agent, the steps are:

    1. The caching agent issues a request to thehome agent that manages the memory inquestion.

    2. The home agent uses its directory structure totarget a snoop to the caching agent that mayhave a copy of the memory in question.

    3. The caching agent responds back to the homeagent with the status of the address. In thisexample, processor #3 has a copy of the linein the proper state, so the data is delivereddirectly to the requesting cache agent.

    4. The home agent resolves any conflicts, and ifnecessary, returns the data to the original

    requesting cache agent (after first checking tosee if data was delivered by another cachingagent, which in this case it was), andcompletes the transaction.

    The Intel QuickPath Interconnect home snoopbehavior implementation typically includes adirectory structure to target the snoop to thespecific caching agents that may have a copy ofthe data. This has the effect of reducing thenumber of snoops and snoop responses that thehome agent has to deal with on the interconnectfabric. This is very useful in systems that have alarge number of agents, although it comes at theexpense of latency and complexity. Therefore,home snoop is targeted at systems optimized fora large number of agents.

    Figure 12. Home Snoop Example

    processor #1

    processor #3

    processor #4

    processor #2

    processor #1

    processor #3

    processor #4

    processor #2

    processor #1

    processor #3

    processor #4

    processor #2

    Step 1.P1 requests data which ismanaged by the P4 homeagent. (Note that P3 has acopy of the line.)

    Step 2.P4 (home agent) checks thedirectory and sends snoop requestsonly to P3.

    Step 3.P3 responds to the snoop byindicating to P4 that it has sentthe data to P1. P3 provides thedata back to P1.

    processor #1

    processor #3

    processor #4

    processor #2

    Step 4.P4 provides the completion of the transaction.

    Labels: P1 is the requesting caching agentP2 and P3 are peer caching agentsP4 is the home agent for the line

    Precondition: P3 has a copy of the line in either M, E or F-state

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    18/22

    18 An Introduction to the Intel QuickPath Interconnect

    Source Snoop

    The source snoop coherency behavior streamlinesthe completion of a transaction by allowing the

    source of the request to issue both the requestand any required snoop messages. The basic flowfor a message involves only three steps,sometimes referred to as a two-hop snoop sincedata can be delivered in step 2. Refer toFigure 13 . Using the same read request discussedin the previous section, the steps are:

    1. The caching agent issues a request to thehome agent that manages the memory inquestion and issues snoops to all the othercaching agents to see if they have copies ofthe memory in question.

    2. The caching agents respond to the homeagent with the status of the address. In thisexample, processor #3 has a copy of the linein the proper state, so the data is delivereddirectly to the requesting cache agent.

    3. The home agent resolves any conflicts andcompletes the transaction.

    The source snoop behavior saves a hop, thereby

    offering a lower latency. This comes at theexpense of requiring agents to maintain a lowlatency path to receive and respond to snooprequests; it also imparts additional bandwidthstress on the interconnect fabric, relative to thehome snoop method. Therefore, the source snoopbehavior is most effective in platforms with only afew agents.

    Figure 13. Source Snoop Example

    processor #1

    processor #3

    processor #4

    processor #2

    processor #1

    processor #3

    processor #4

    processor #2

    processor #1

    processor #3

    processor #4

    processor #2

    Step 1.P1 requests data which is managedby the P4 home agent. (Note thatP3 has a copy of the line.)

    Step 2.P2 and P3 respond to snooprequest to P4 (home agent).P3 provides data back to P1.

    Step 3.P4 provides the completion ofthe transaction.

    Labels: P1 is the requesting caching agentP2 and P3 are peer caching agentsP4 is the home agent for the line

    Precondition: P3 has a copy of the line in either M, E or F-state

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    19/22

    An Introduction to the Intel QuickPath Interconnect 19

    Performance

    Much is made about the relationship between theprocessor interconnect and overall processor

    performance. There is not always a directcorrelation between interconnect bandwidth,latency and processor performance. Instead, theinterconnect must perform at a level that will notlimit processor performance. Microprocessorperformance has been continually improving byarchitectural enhancements, multiple cores, andincreases in processor frequency. It is importantthat the performance of the interconnect remainsufficient to support the latent capabilities of aprocessors architecture. However, providing more

    interconnect bandwidth or better latency by itselfdoes not equate to an increase in performance.There are a few benchmarks which create asynthetic stress on the interconnect and can beused to see a more direct correlation betweenprocessor performance and interconnectperformance, but they are not good proxies forreal end-user performance. The Intel QuickPathInterconnect offers a high-bandwidth and low-latency interconnect solution. The raw bandwidthof the link is 2X the bandwidth of the previousIntel Xeon processor front-side bus. To the firstorder, it is clear to see the increase in bandwidththat the Intel QuickPath Interconnect offers.

    Raw Bandwidth

    Raw and sustainable bandwidths are verydifferent and have varying definitions. The rawbandwidth, or maximum theoretical bandwidth,is the rate at which data can be transferredacross the connection without any accounting for

    the packet structure overhead or other effects.The simple calculation is the number of bytesthat can be transferred per second. The Intel QuickPath Interconnect is a double-pumped databus, meaning data is captured at the rate of onedata transfer per edge of the forwarded clock. Soevery clock period captures two chunks of data.The maximum amount of data sent across a full-

    width Intel QuickPath Interconnect is 16 bits,or 2 bytes. Note that 16 bits are used for thiscalculation, not 20 bits. Although the link has upto 20 1-bit lanes, no more than 16 bits of real

    data payload are ever transmitted at a time, sothe more accurate calculation would be to use 16bits of data at a time. The maximum frequencyof the initial Intel QuickPath Interconnectimplementation is 3.2 GHz. This yields a double-pumped data rate of 6.4 GT/s with 2 bytes pertransition, or 12.8 GB/s. An Intel QuickPathInterconnect link pair operates two uni-directional links simultaneously, which gives afinal theoretical raw bandwidth of 25.6 GB/s.Using similar calculations, Table 6 shows theprocessors bus bandwidth per port.

    Table 6. Processor Bus BandwidthComparison, Per Port

    1 Source: Intel internal presentation, December 2007.

    Packet Overhead

    The maximum theoretical bandwidth is notsustainable in a real system. Bandwidth isimplementation specific. A step closer to

    sustainable bandwidth is accounting for thepacket overhead. Although still not a truemeasurable result, it is closer to real-lifebandwidth. The typical Intel QuickPathInterconnect packet has a header Flit whichrequires four Phits to send across the link. Thetypical data transaction in microprocessors is a64-byte cache line. The data payload, therefore,

    Bandwidth(Higher is better)

    I n t e l

    F r o n

    t S i d e

    B u s

    I n t e l

    Q u

    i c k P a

    t h

    I n t e r c o n n e c

    t

    Year 2007 2008Rate (GT/s) 1.6 6.4

    Width (bytes) 8 2

    Bandwidth (GB/s) 12.8 25.6

    Coherency Yes Yes

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    20/22

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    21/22

    An Introduction to the Intel QuickPath Interconnect 21

    Reliability, Availability, andServiceability

    A key difference between desktop computers and

    servers is reliability, availability, and serviceability.Although all users want a reliable computer, theimpact of a problem is typically much greater on aserver running a business core application thanon a personal workstation. Therefore, processorsused in servers typically have more RAS featuresthan their client brethren. For example, moreexpensive ECC memory is standard in servers,whereas clients use the less expensive non-ECCDRAM. Likewise, on the processor interconnect,RAS features are important to a server processor

    interconnect, especially the larger expandableand mission-critical servers found in IT datacenters. The Intel QuickPath Interconnect wasdesigned with these types of applications in mindand its architecture includes several importantRAS features. Product developers will optimizethe overall system design by including the set ofRAS features as needed to meet the requirementsof a particular platform. As such, the inclusion ofsome of these features is product specific.

    One feature common across all Intel QuickPathInterconnect implementations is the use of CRC.The interconnect transmits 8 bits of CRC withevery Flit, providing error detection without alatency performance penalty. Other protocols useadditional cycles to send the CRC, thereforeimpacting performance. Many of these RASfeatures were identified previously in the Linklayer section of this document. These RASfeatures include retry mode and per-Flit CRC, aswell as product-specific features such as rolling

    CRC, hot detect, link self healing, and clock fail-over.

    Processor Bus Applications

    In recent years, Intel has enabled innovativesilicon designs that are closely coupled to the

    processor. For years, the processor bus (front-side bus) has been licensed to companies todevelop chipsets that allow servers with eight,sixteen, thirty-two, or more processors. Morerecently, several companies have developedproducts that allow for a re-programmable FPGAmodule to plug into an existing Intel Xeon processor socket. These innovative acceleratorapplications allow for significant speed-up of nichealgorithms in areas like financial analytics, imageprocessing, and oil and gas recovery. These

    innovative applications will continue on Intelsnew processor platforms that use the Intel QuickPath Interconnect instead of the front-sidebus. Unlike high-performance I/O and acceleratordesigns found on PCI Express*, these innovativeproducts require the benefits of close coupling tothe processor, such as ultra-low latency and cachecoherency. The bulk of high-performance I/O andaccelerator designs are found on the PCI Express*bus, which has an open license available throughthe PCI SIG. This bus, however, has the limitationof being non-coherent. For applications requiringcoherency, the Intel QuickPath Interconnecttechnology could be licensed from Intel. As newinnovative accelerator applications are developed,they can take advantage of PCI Express* for thebroadest set of applications and Intel QuickPathInterconnect for the remaining cache-coherentniche.

  • 8/12/2019 Quick Path Interconnect Introduction Paper

    22/22

    Summary

    With its high-bandwidth, low-latencycharacteristics, the Intel QuickPath Interconnect

    advances the processor bus evolution, unlockingthe potential of next-generation microprocessors.The 25.6 GB/s of low-latency bandwidth providesthe basic fabric required for distributed sharedmemory architectures.The inline CRC codesprovide more error coverage with less overheadthan serial CRC approaches. Features such aslane reversal, polarity reversal, clock fail-over,and self-healing links ease design of highlyreliable and available products.The two-hop,source snoop behavior with cache line forwarding

    offers the shortest request completion inmainstream systems, while the home snoopbehavior allows for optimizing highly scalableservers. With all these features and performanceits no surprise that various vendors are designinginnovative products around this interconnecttechnology and that it provides the foundation forthe next generation of Intel microprocessors andmany more to come.