12
194 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012 A Fully Integrated Multi-CPU, Processor Graphics, and Memory Controller 32-nm Processor Marcelo Yuffe, Moty Mehalel, Ernest Knoll, Joseph Shor, Senior Member, IEEE, Tsvika Kurts, Eran Altshuler, Eyal Fayneh, Kosta Luria, and Michael Zelikson Abstract—This paper describes the second-generation Intel Core processor, a 32-nm monolithic die integrating four IA cores, a pro- cessor graphics, and a memory controller. Special attention is given to the circuit design challenges associated with this kind of integra- tion. The paper describes the chip oor plan, the power delivery network, energy conservation techniques, the clock generation and distribution, the on-die thermal sensors, and a novel debug port. Index Terms—Clocking, Intel second-generation core, low Vc- cmin, modularity, power gates, thermal sensors. I. INTRODUCTION T HE desktop and mobile computer market place is con- stantly looking for system performance improvements, lower power dissipation density and better form factors for miniaturization; these three vectors seem to contradict each other. The 32-nm Second Generation Intel Core (SGIC) processor tackles this paradigm by integrating up to four high-perfor- mance Intel Architecture (IA) cores, a power/performance optimized processor graphics (PG), and memory and PCIe con- trollers in the same die. The chip is manufactured using Intel’s 32-nm process which incorporates the second generation of Intel’s high-k metal gates for improved leakage current control; the process also provides nine copper interconnect metal layers that were well exploited for top-level interconnect as well as for robust power delivery. The SGIC architecture block diagram is shown in Fig. 1, and the oor plan of the four IA-core version is shown in Fig. 2. The SGIC IA core implements an improved branch prediction algo- rithm, a micro-operation (Uop) cache, a oating point advanced vector extension (AVX), a second load port in the L1 cache, and bigger register les in the out-of-order part of the machine; all of these architecture improvements boost the IA core performance without increasing the thermal power dissipation envelope or the average power consumption (to preserve battery life in mo- bile systems). Although these architectural advances are beyond the scope of this paper, the Intel AVX, which enhanced the SSE 128-bit vectors into 256 b is worth mentioning (more information about the SGIC architectural features can be found in [1]). The AVX Manuscript received April 28, 2011; revised June 29, 2011; accepted July 29, 2011. Date of publication October 13, 2011; date of current version December 23, 2011. This paper was approved by Guest Editor Alice Wang. The authors are with Intel Corporation, Haifa 31015, Israel (e-mail: marcelo. [email protected]). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/JSSC.2011.2167814 Fig. 1. SGIC block diagram. Fig. 2. SGIC oorplan, power planes, and choppability axes. architecture supports three-operand syntax which allows more efcient coding by the compiler. Additional instructions were added to simplify auto-vectorization of high level languages to assembly by the compiler. The SGIC architecture added instruc- tions which support single- and double-precision oating-point data types. The additional state needed for the new growth of the 16 registers to 256 b, is supported by new xSave/xRestor instructions that were designed to support additional future ex- tensions of Intel 64 architecture. The CPUs and PG share the same 8-MB level-3 cache (L3$) memory. The data ow is optimized by a high performance on die interconnect fabric (called “ring”) that connects between the CPUs, the PG, the L3 cache, and the system agent (SA) unit. The SA houses a 1600 MT/s, dual-channel DDR3 memory con- troller, a 20-lane PCIe gen2 controller, a two-parallel-pipe dis- play engine, the power management control unit, and the testa- 0018-9200/$26.00 © 2011 IEEE

Integrated Multi-CPU

Embed Size (px)

DESCRIPTION

Integrated Multi-CPU, Processor

Citation preview

Page 1: Integrated Multi-CPU

194 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

A Fully Integrated Multi-CPU, Processor Graphics,and Memory Controller 32-nm Processor

Marcelo Yuffe, Moty Mehalel, Ernest Knoll, Joseph Shor, Senior Member, IEEE, Tsvika Kurts, Eran Altshuler,Eyal Fayneh, Kosta Luria, and Michael Zelikson

Abstract—This paper describes the second-generation Intel Coreprocessor, a 32-nm monolithic die integrating four IA cores, a pro-cessor graphics, and amemory controller. Special attention is givento the circuit design challenges associated with this kind of integra-tion. The paper describes the chip floor plan, the power deliverynetwork, energy conservation techniques, the clock generation anddistribution, the on-die thermal sensors, and a novel debug port.

Index Terms—Clocking, Intel second-generation core, low Vc-cmin, modularity, power gates, thermal sensors.

I. INTRODUCTION

T HE desktop and mobile computer market place is con-stantly looking for system performance improvements,

lower power dissipation density and better form factors forminiaturization; these three vectors seem to contradict eachother.The 32-nm Second Generation Intel Core (SGIC) processor

tackles this paradigm by integrating up to four high-perfor-mance Intel Architecture (IA) cores, a power/performanceoptimized processor graphics (PG), and memory and PCIe con-trollers in the same die. The chip is manufactured using Intel’s32-nm process which incorporates the second generation ofIntel’s high-k metal gates for improved leakage current control;the process also provides nine copper interconnect metal layersthat were well exploited for top-level interconnect as well asfor robust power delivery.The SGIC architecture block diagram is shown in Fig. 1, and

the floor plan of the four IA-core version is shown in Fig. 2. TheSGIC IA core implements an improved branch prediction algo-rithm, a micro-operation (Uop) cache, a floating point advancedvector extension (AVX), a second load port in the L1 cache, andbigger register files in the out-of-order part of the machine; all ofthese architecture improvements boost the IA core performancewithout increasing the thermal power dissipation envelope orthe average power consumption (to preserve battery life in mo-bile systems).Although these architectural advances are beyond the scope

of this paper, the Intel AVX, which enhanced the SSE 128-bitvectors into 256 b is worth mentioning (more information aboutthe SGIC architectural features can be found in [1]). The AVX

Manuscript received April 28, 2011; revised June 29, 2011; accepted July 29,2011. Date of publication October 13, 2011; date of current version December23, 2011. This paper was approved by Guest Editor Alice Wang.The authors are with Intel Corporation, Haifa 31015, Israel (e-mail: marcelo.

[email protected]).Color versions of one or more of the figures in this paper are available online

at http://ieeexplore.ieee.org.Digital Object Identifier 10.1109/JSSC.2011.2167814

Fig. 1. SGIC block diagram.

Fig. 2. SGIC floorplan, power planes, and choppability axes.

architecture supports three-operand syntax which allows moreefficient coding by the compiler. Additional instructions wereadded to simplify auto-vectorization of high level languages toassembly by the compiler. The SGIC architecture added instruc-tions which support single- and double-precision floating-pointdata types. The additional state needed for the new growth ofthe 16 registers to 256 b, is supported by new xSave/xRestorinstructions that were designed to support additional future ex-tensions of Intel 64 architecture.The CPUs and PG share the same 8-MB level-3 cache (L3$)

memory. The data flow is optimized by a high performance ondie interconnect fabric (called “ring”) that connects between theCPUs, the PG, the L3 cache, and the system agent (SA) unit.The SA houses a 1600 MT/s, dual-channel DDR3 memory con-troller, a 20-lane PCIe gen2 controller, a two-parallel-pipe dis-play engine, the power management control unit, and the testa-

0018-9200/$26.00 © 2011 IEEE

Page 2: Integrated Multi-CPU

YUFFE et al.: A FULLY INTEGRATED MULTI-CPU, PROCESSOR GRAPHICS, AND MEMORY CONTROLLER 32-NM PROCESSOR 195

bility logic. An on-die PROM is used for configurability andyield optimization.

II. MODULAR FLOOR PLAN

From the beginning of the project, the SGIC was conceivedas a modular design that will allow the integration of differentblocks into a single chip. The SGIC team opted to divide the chipinto several modules: IA core, SA, PG, L3 cache, and I/O; themodules were designed independently and assembled togetherby a dedicated full chip team that took care of the integration ofthe different modules and the full chip validation aspects.The PG module is of special interest because it was designed

using completely different design methodologies and CADtools; this module even used a completely different standardcell library and a separated power delivery network. The key tothe smooth integration of this block into the rest of the chip wasthe ring bus, which provides a common protocol for all of themodules of the chip, allowing resource sharing between the dif-ferent modules (for example, the L3$ space can be accessed byany of the modules). The ring protocol and the ring distributedcontroller take care of the ring traffic to minimize performanceimpact due to data traffic congestion. The design team also tookadvantage of the common interconnect protocol and physicallayer provided by the ring to bridge between different designmethodologies used for the design of the different modules;this was especially important for the integration of PG, whichwas designed with a completely different design methodology.The modular ring interconnect enables the four-core die to

be easily converted into a two-core die by “chopping” out twocores and two L3 cache modules as described in Fig. 2. Ad-ditional optimizations can be done by reducing the number ofexecution units of the PG or by reducing the L3 cache size. Thismodular floor plan technique converts the tedious and time-con-suming task of creating die variations into a simple databasemanagement exercise, considerably reducing the time it takesto bring the different flavors of the product to the market. TheSGIC was implemented in three different flavors: the die size ofthe i7 2820QM model (four IA cores, 8 MB L3$, 12EU PG) is216 mm , the die size of the i7 2620 M model (two IA cores, 4MB L3$, 12EU PG) is 149 mm , and the die size of the i3 2100model (two IA cores, 3 MB L3$, 6EU PG) is 130 mm .

III. POWER DELIVERY NETWORK AND

EMBEDDED POWER GATES

Although the SGIC implements Intel SpeedStep technologyfor minimizing the power consumed by the CPU, the product re-quirements have clearly indicated that efficient power gating is amust in order to meet the aggressive average power goals. Def-inition of the power delivery network (PDN) topology and, inparticular, the implementation of power gates, in the SGIC werebased on several criteria: 1) co-optimize the quality of power de-livery on both die and package levels; 2) enable flexible powermanagement, i.e., support fine granularity for gated and ungatedregions and support different power states; 3) minimize the en-ergy penalty, associatedwith power gates switching; and 4)min-imize the amount of switching noise injected by the power gateswitching.

Fig. 3. SGIC core IREM image when (a) the processor is idle and (b) duringdeep sleep.

Fig. 4. Residual gated voltage in C6 state with (“strong”) and without (“weak”)negative bias.

The configuration chosen based on the above criteriacomprises P-type embedded power gates (EPGs), i.e., powerswitches that reside inside the gated region, forming a gridof power transistors connected among themselves by a gatedpower grid. The total width of EPG is approximately 2 m,which is of the accumulated width of the SNB IAcore transistors. Due to the dense and very regular layout,the area consumed by the power gates is 3.8% of the corearea. Since a nongated version of the SGIC core is notavailable, it is practically impossible to quantify the effectof power gates on timing; however, the product has met alltiming/performance goals. An ungated power grid, whichspans the whole core, shares the two top metal layer resourceswith the gated PDN. Fig. 3 shows an infrared emissionmicroscopy (IREM) photo of the SGIC core for two powerstates: idle—C1 [Fig. 3(a)] and deep sleep—C6 [Fig. 3(b)].As can be seen from Fig. 3(b), in the C6 state, most of thesupply voltage falls across the power gates, resulting in thevisible EPG grid (thin vertical lines spread over the core).Such PDN topology enables allocation of selected fubs orindividual circuits to either gated or ungated supply. Brightspots in Fig. 3(b) represent circuitry that is fed by ungatedpower supply, immersed into a region of gated logic: thecontrol hub that supports snoops during C6 state, and PLL.Small light “dots” in Fig. 3(b) represent individual ungated

Page 3: Integrated Multi-CPU

196 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 5. EPG gate biasing circuit.

Fig. 6. Voltage dependence of the of the supply path resistance. Simulation temperature was adjusted to factor self heat.

circuitry like thermals probes and ungated repeater. Such alocal coexistence of gated and ungated logic is used moreextensively in the System Agent, where all logic blocksassociated with PCIe are gated individually.The SGIC PDN topology enables minimization of the EPG

switching losses: 1) all on-package decoupling capacitors areconnected to the ungated power supply, which does not switchand 2) since power gates are “immersed” in the gated powergrid, it is possible to discharge most of the energy accumulatedat EPG switching nodes into gated power supply.In order to increase the efficiency of power gating, the

gate–source voltage of the pMOS switches was negativelybiased (i.e., the gate is driven by a voltage higher than thesource voltage), driving the transistors into deeper subthresholdregime. The biasing circuit is situated in the ring area, nearthe core, and the bias voltage is distributed across the wholecore using a dedicated low-resistance grid. Fig. 4 presents theresults of a measurement of the residual gated voltage duringthe C6 state. The principle schematic of the biasing circuit isshown in Fig. 5: VCCA is an on-die high-voltage power supplyand VCCB is the voltage driven into the gates of the power

gating transistors. The switching speed of this circuit is tuned tocontrol the power gate switching strength in order to minimizethe switching noise injected by this operation.The core PDN performance was analyzed with a commer-

cially available grid simulator for several different real stresses,including power virus and idle state. Simulation results for theidle state were compared to corresponding measured data, asshown in Fig. 6.Die power dissipation in C6 state was measured with enabled

and disabled embedded power gates. The difference betweenthe two measurements yields the corresponding power savings.The same was performed in System Agent in order to quantifythe effectiveness of PCIe gating. By measuring the part powerdissipation when the power gates are enabled and when theyare disabled it is possible to measure the average power meritof the power gates. This was done on a representative set ofSGIC units at two different temperatures (110 C and 50 C)and at two different power supply voltages (0.88 V and 1.10 V);power gating saves approximately of the total powerdissipation, which translates to savings of more than 90% of theIA core power dissipation.

Page 4: Integrated Multi-CPU

YUFFE et al.: A FULLY INTEGRATED MULTI-CPU, PROCESSOR GRAPHICS, AND MEMORY CONTROLLER 32-NM PROCESSOR 197

Fig. 7. RF shared strength control pMOS devices.

IV. VCCMIN MINIMIZATION

As shown in Fig. 2 in the SGIC, the cores, the ring, and L3$are sharing the same power plane. The key challenge in thisscheme is bringing all these components to meet comparableVccmin levels. This approach guarantees that the overall powerconsumption at full-chip level will be minimal at low powerstates when the chip is running at the lowest possible voltageneeded to support a specific operating frequency. In addition,this approach eliminates any redundant design overkill in oneof the components. For example, if L3$ is limited to run atmuch higher Vccmin than the core register files (RFs), then theRFs could have been designed to higher Vccmin with no im-pact on the overall Vccmin at full-chip level, but with area ben-efits by using smaller cells, and with power benefits by usinglow-leakage transistors.There are three components that limit Vccmin: logic paths,

RFs, and small-signal arrays (SSAs). L3$ is the largest SSA asit includes the largest number of devices, and most of them areat minimal sizing. As a result, the random statistical variationsin L3$ are very significant, so L3$ was the biggest design andprocess challenge to make it running at the same low-powersupply voltages as the core logic. Previous processors solvedthis problem by connecting the L3$ to a separate higher voltagepower plane; however, this approach considerably increases thepower dissipated by the L3 cache itself and taking into accountthat the SGIC implements 3, 4, or 8 MB of L3 cache capacity(depending on the chip configuration); the power dissipated bythe cache memory accounts for a big portion of the overallpower consumption of the die. Enabling L3$ to run at low Vc-cmin has saved 1.2-W average power for the SGIC quad-coredie.The other arrays like the RFs and smaller SSAs required

design focus and attention as well. These arrays are runningat higher bandwidth than the L3$ so the timing constrains aretighter. Several circuit and logic design techniques have beendeveloped to minimize the Vccmin of the SSAs and the RFsof the chip to bring them to a lower level than the core logic.

Fig. 8. Vccmin improvement after applying SGIC Vccmin reduction circuits.

Fig. 7 illustrates one of these techniques in the RFs. Randomfabrication variations may cause RF write-ability degradation atlow voltages; this technique weakens the memory-cell pull-updevice effective strength, solving the low-voltage write-abilityissue caused by a too strong pMOS device in the memorycells. The effective size of the shared pMOS is set duringproduction testing by enabling any combination of the threeparallel transistors. Similar techniques have been developed forL3$ and other SSAs. Fig. 8 shows the Vccmin distribution ofthe baseline and its improvement in the SGIC.A key component in the success to meet the aggressive Vc-

cmin target is an accurate modeling at presilicon stage. Largearrays, as opposed to logic paths, cannot be fixed at postsiliconstages because of their large area and high density. An accuratestatistical simulation algorithm has been developed to cover allthe failure modes of the arrays—write-ability, read stability, anddata retention (soft errors may become an important Vccminlimiter if not taken into account properly, but this limitation canbe easily solved by protecting the few problematic state ele-ments with parity or by using error correction code or by simplyincreasing the area of those state elements). The input data to themodel includes the variations of the main parameters of the tran-sistors like threshold voltage, effective channel length, mobility,and velocity saturation. All of the failure modes are simulated in

Page 5: Integrated Multi-CPU

198 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 9. Silicon versus simulation Vccmin results at various modes and programming levels.

Fig. 10. SGIC final Vccmin results of the L3 SSA, RFs, and random logic.

transient ac mode to accurately model the real activation condi-tions of the cells. This approach is different than the traditionalalgorithm to model Read stability by using static noise marginanalysis [2]. Fig. 9 shows the correlation between silicon resultsand the simulation model. The -axis includes various opera-tion modes of L3. WRpgm* refers to write operation at differentprogramming levels of the shared pMOS device (see Fig. 7).In production, the optimal setting is used (WRpgm2 in Fig. 9).AC_Ret stands for retention failure during write operation. Thisfailure mechanism is related to the voltage droop that the sharedpMOS creates on the memory array power supply. SLEEP_Retis the retention Vccmin when the L3 sleep-transistor is enabledfor power reduction.The final outcome of this work is shown in Fig. 10. The ob-

tained Vccmin of the three components of the core/ring is equal-ized and there is no clear limiter. L3 result is obtained whenthe optimal programming setting is used and the redundancyrecovery mechanism is applied. The shown result is obtainedwhen the optimal programming is set and while the chip isrunning under normal operating conditions with all the powersaving mechanisms enabled. The logic part has no special cir-cuit techniques and the result reflects the speed limitation of thecritical paths at low voltage operation, after post silicon speed up

at minimum operating voltage/minimum operating frequencypoint was done.

V. HIGH-BANDWIDTH LOW-LATENCY CACHE ACCESSTHROUGH THE RING

The ring provides the common platform used to connect thedifferent modules (CPUs, shared L3$’s, PG, SA). To maximizeperformance, high-bandwidth low-latency cache access is re-quired. Cache access messages are synchronously staged byhigh-phase transparent latches in the ring-stops and by low-phase latches at clock domain crossing (Fig. 11). Instantaneousclock skew (systematic skew and random jitter) degrades thetiming accuracy and limits the ring frequency. Synchronizationbuffers can allow higher frequency by adding latency that af-fects overall performance. The clocking scheme (described inSection VI) provides the skew and jitter required by the ringperformance. If the instantaneous skew is less than the skewbudget, the data propagates through all latches in transparency.Double de-skew latch (Fig. 12) provides robust cross-domain

race margin without extra latency. The amount of wires werelocally doubled to halve the switching frequency allowinglarger skew between the clock domains, thanks to the localitythe global routing resource accounting was not impacted and

Page 6: Integrated Multi-CPU

YUFFE et al.: A FULLY INTEGRATED MULTI-CPU, PROCESSOR GRAPHICS, AND MEMORY CONTROLLER 32-NM PROCESSOR 199

Fig. 11. Ring path. Phase one (PH1) ring stop and phase two (PH2) single cross-domain example de-skew latches.

Fig. 12. Cross clock domain path.

therefore the die size was not affected. During even cycle the“even” latch and “even” side of the mux-latch are transparent;the same is for the “odd”. The local traffic (latch – mux_latch)is at half frequency thus improving the race margin withoutaffecting the max delay (mux_latch output is at full speed).A “Valid” signal is sent from a similar structure to the nextring stop few gates delay before the data. It enables the clockrise that opens the latches. Valid rising edge propagates intransparency through the clock path; late falling edge maycause sampling of undetermined data, which is not further usedsince it is not valid ( ). Sharing latch and arbitrationby a mux-latch reduces data propagation delay. The “Valid”signal participates in ring stop arbitration between the passingmessage and a new message (Request) through a mux-latchcontrol during the transparency window, as seen in Fig. 12.The propagation of the “Valid” signal through the local

clock network during transparency window allows it to begenerated no earlier than the data, thus reducing latency.Static timing tools model the path through the latches andthrough local clock drivers from data or clock gate inputs tolatch output as one transparency path consisting of severalstages. A multicycle path is modeled from the latch openthrough several ring-stops to the receiver off the ring or to alatch capture. The clock-domain information is preserved forpruning and for max-skew-aware margin calculations. Accu-rate nonworse-casing timing model allows the low-latency ringimplementation. The skew budget is only two thirds of a phase,allowing a shorter transparency window for improved race

Fig. 13. SGIC clocking scheme.

immunity. This is twice the real skew to avoid nonreproduciblecross-domain paths failures due to PLL jitter.

VI. CLOCK GENERATION AND DISTRIBUTION

The clocking scheme shown in Fig. 13 employs 13 PLLs togenerate the clocks for different domains [3], [4]. The IA cores,the L3 cache, and the ring (which are sharing the same powerplane) are running at the same frequency ([5] provides an excel-lent dissertation of the architecture tradeoffs involved with corefrequency selection). In order to minimize the skew and powerin the clock distribution, each slice (CPU, L3 cache, and ringstop) uses its own PLL; the RCLK PLL assures low clock skewamong the reference clocks of the entire die PLLs despite thedifferent operating voltages of the different clock networks. A

Page 7: Integrated Multi-CPU

200 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 14. (a) LNPLL VCO block diagram. (b) VCO tuning network.

low-jitter PLL (LNPLL) design enables the ring data flow withminimal latency.The PLL random jitter is mainly determined by the VCO

quality. The LNPLL VCO [Fig. 14(a)] is a three-CMOS stage,full-swing ring oscillator. The VCO stage is loaded by twotuning networks [based on varactors, see Fig. 14(b)] that changetheir loading under the control of a dc input voltage. The bottomload is controlled by the PLL control voltage, thus adjustingthe PLL frequency and phase. The upper load is controlled by aPTAT circuit. The temperature inverse-proportional loading ofthe VCO stages stabilizes the VCO frequency with temperaturechanges; this assures that the PLL remains locked at all of theoperating temperatures despite the relatively low gain of theVCO. Metal capacitors are used in series with the varactor toseparate the dc control voltage from the ac signal at VCO stageoutput. Resistors are used to isolate the output of the VCOstage from the low-impedance dc source—either the PLL loopfilter or the PTAT circuit.Due to the limited capacitance ratio of the varactors, the

required frequency range is covered by five overlapping fre-quency bands. The bands are implemented by five parallelswitched varactor blocks. Within band, the frequency ratio isbetter than 1.35, while overall VCO frequency ratio is betterthan 2.5. A banding finite-state machine (FSM) is used to selectthe appropriate VCO band for the required clock frequency. Thebanding FSM can operate in two modes, automatic-band-select(ABS) mode and open-loop-frequency mapping mode (OLFM).In the ABS mode, the banding is determined by the FSM duringthe closed-loop locking process. In the OLFM mode, the VCObands limits are measured at open-loop, and then the requiredPLL output frequencies are mapped to a corresponding VCOband by a lookup table. The OLFMmode is used for fast-relockfrequency change clock generators to support Intel SpeedSteptechnology.The PTAT temperature compensation led to better than 4%

open-loop VCO frequency change for 120 C temperaturechange. The CMOS ring oscillator VCO provides good randomnoise performance, but has poor power supply rejection ratio(PSRR). In order to achieve the required performance, the PLL

is supplied by an on-die low-noise linear voltage regulator, asreported in [6], with better than 40-dB PSRR and less than50- V random voltage noise, at all of the frequency spectrums;this power supply source is completely separated from theCPU main power supply noise to avoid noise injection fromthe digital parts of the die (and the power gates) into the PLLpower supply rail.The measured rms longterm jitter is better than 2 ps for all

bands and all frequency ranges; the period jitter is less than 0.2ps. For example, the integrated phase noise (1.5 MHz to 1 GHz)for a 3.3-GHz clock signal is ps, as presented in Fig. 15.The clock distribution within one slice (CPU and adjacent

L3 and ring stop) is shown in Fig. 16. The Slice PLL generatesthe clock that is distributed through vertical spines, two in theL3 cache and three within the IA core. The spines drive globalclock islands, allowing fine granularity clock gating for powersavings.The skew within slice is kept low using clock compensators,

controlled by dedicated state machines. The slice PLL closes theloop through the L0 spine; thus, L0 spine is deskewed to the PLLreference and can act as the slice timing reference. The phases ofthe two adjacent spines are compared with the timing reference,and the compensators are controlled to practically eliminate theskew due to within die variation. C0 and L1 spines are deskewedto L0, then C1 to C0, and finally C2 is deskewed to C1.The spine to spine maximum skew is 10 ps, while the overall

slice max skew is 16 ps. The scope image in Fig. 17 was probedon a specific die between two adjacent spines with a skew of 1.4ps. The overall slice clock distribution power is 600 mW at 1 Vand 3.3 GHz.

VII. THERMAL

One of the important functions in a processor is temperaturecontrol. When the chip gets too hot, its frequency needs to belowered, in order to allow it to cool down. This process is called“throttling.” It is important to have an accurate thermal sensorto provide this temperature information, since the sensing ac-curacy directly influences performance in this case. In addition,the thermal sensor provides information for fan regulation in the

Page 8: Integrated Multi-CPU

YUFFE et al.: A FULLY INTEGRATED MULTI-CPU, PROCESSOR GRAPHICS, AND MEMORY CONTROLLER 32-NM PROCESSOR 201

Fig. 15. Measured phase noise.

Fig. 16. Slice clock distribution.

temperature range 50 C–100 C. There is also a fail-safe “cat-astrophic” function which shuts down the chip in the event thatthe temperature spikes significantly above the throttle point.The SGIC has two types of thermal sensors. The first is a

diode-based thermal sensor described in [7] that compares thediode voltage (which has a negative temperature coefficient)to a reference voltage to output the temperature. This sensorfunctions over a very large temperature range of operation( 25 C to 150 C), providing information for throttling,catastrophic function and fan regulation. This sensor has beenused in many generations of Intel processors. However, thediode-based sensor is rather large (83 000 m ), so there isonly one such sensor per core. Our simulations and siliconstudies have determined that during different applications,different areas of the core can get hot. In order to measure these

hot-spots, we have introduced a miniaturized CMOS-basedthermal sensor [8]. This sensor has a substantially reduced area(5100 m ) compared with the diode sensor (Fig. 18), but has amore limited accurate temperature range when single-point cal-ibration is used (due to within-die variation, every sensor mustbe calibrated independently). It is shaped as a tall thin block,enabling it to be placed into the repeater channels, which arenormally used for cross-chip signaling buffers. These channelsare very heavily populated with higher level metals, but containvery few transistors and lower metals. Therefore, the placementof the CMOS sensors in these channels makes them essentially“free” (since the sensors are fed by an independent low noisepower supply source the area needed for this dedicated powernetwork should be taken into account while planning the globalchip routing resources). The CMOS sensor allows the SGICto accurately throttle, based on localized hot-spot sensing, forDFT, real-time measurements and burn-in. The CMOS sensorsare also used heavily in burn-in, when the temperature gradientson the chip become very high and the sensors insure that thelocalized temperature will not exceed the intended burn-intemperature.As described in [7], the CMOS sensor output is proportional

to the transistor threshold voltage and the mobility; silicon mea-surements are confirming this. For the reader’s convenience theCMOS sensor is explained here again to allow better interpre-tation of the importance of the silicon results presented in thispaper.

Page 9: Integrated Multi-CPU

202 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 17. Measured spine–spine clock skew.

Fig. 18. Area comparison of the diode-based thermal sensor [7] and the CMOS sensor [8].

A simplified circuit schematic of the CMOS Sensor is shownin Fig. 19. The voltage reference circuit on the left is used togenerate bias currents and a bias voltage , which is roughlyequal to . The amplifier A2 forces the drain voltage of M5 tobe equal to . M5 is in the linear mode of operation becauseit shares a gate bias with M4. Thus, the equation describing itscurrent is

(1)

where is the electron mobility and is the thresholdvoltage, both of which decrease with temperature for the range

of interest. This current is mirrored by M2 and M3 and inte-grated over the capacitor . The voltage on is comparedwith and is used to trigger a pulse generator which isused to discharge and is also the frequency output of thecircuit. The frequency obeys the following equation:

(2)

The CMOS sensor frequency is input to a counter such that theoutput count is proportional to the frequency. This count wascompared with electrical test parameters of devices in the scribelines of the wafers. This is shown in Fig. 20(a) and (b) for several

Page 10: Integrated Multi-CPU

YUFFE et al.: A FULLY INTEGRATED MULTI-CPU, PROCESSOR GRAPHICS, AND MEMORY CONTROLLER 32-NM PROCESSOR 203

Fig. 19. Schematic diagram of the CMOS sensor.

Fig. 20. Correlation of CMOS sensor count to (a) nMOS Idsat and (b) nMOS .

thousands of units from different lots, during wafer sort. It wasfound that the count was well correlated to nMOS Idsat (whichis proportional to mobility), in Fig. 20(a) and to the nMOS ,as in Fig. 20(b). The count was uncorrelated to other sort pa-rameters (e.g., pMOS , pMOS mobility, and resistance). Thelinear correlation of the count (e.g., frequency) of the CMOSsensor to the nMOS mobility and proves the validity of (2).

VIII. GDXC: A NOVEL ON-DIE PROBING TECHNIQUE

Due to the high integration of the SGIC, external bussesare not observable by external equipment. To overcome thisissue, the SGIC incorporates a dedicated die probing port calledGeneric Debug eXternal Connection (GDXC) which outputsinternal information in a packet format to be used for debugand validation purposes. GDXC is an essential debug tool for

the SGIC from power-on of the first silicon until post launchwhen the part is in mass production. GDXC comprise of debugbus that allows monitoring the traffic between the IA cores,PG, caches and System Agent. GDXC has a dedicated portthrough at which the SGIC exposes selected parts of its ringinternal buses and functions as well as power managementevent information. GDXC is a nonintrusive methodology thatanswers much of the debug community’s concerns by providingsome observability of the ring at high speed and out-of-orderprotocols. Its output may be connected to third-party logic ana-lyzer or to custom sampling logic. The GDXC port is composedby 16 output lanes, using a PCIE-based protocol. GDXC alsoincludes a hardware method for triggering—called G-ODLAT,which can enable early triggering on a failure scenario.GDXC location allows observation of the ring for protocol

correctness. It provides observability of the four functional sub-

Page 11: Integrated Multi-CPU

204 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 21. GDXC queue architecture.

Fig. 22. GDXC top side connector.

rings that comprise the ring. In addition, GDXC provides ob-servability of power management transactions and Serial VIDcommands, to monitor the interaction between the SGIC and theexternal voltage regulator. GDXC helps to understand the sce-nario that leads to the failure by packet format, with its uniquetime stamp method, makes it possible to time-align the eventsobserved by GDXC from different on-die modules to the sametime scale. GDXC comprises of a set of queues that hold packetsuntil they are issued out to the logic analyzer (Fig. 21).Since there are many queues that lead to a narrow pipe of x16

PCIe lanes, GDXC is susceptible to overflow. Thus, at the en-trance of the queues there is a “Qualifier” which is used to filterout packets that are not essential to the current debug. Thesequalifiers significantly reduce GDXC susceptibility to overflow.The connection to the external logic analyzer is done by a top

side port which is located on the top side of the package. Thisapproach saves the need of package pin allocation while im-proving in situ accessibility for debugging in a system (Fig. 22).

Fig. 23. SGIC die photograph.

IX. CONCLUSION

The Second Generation Intel Core was introduced to themarket in early 2011. The part is offered in a variety of config-urations and packaging for optimal performance, cost, powerconsumption, and form factor adaptation to the target systemrequirements.The thermal dissipation power (TDP) of the SGIC ranges

from 17 to 45 W for a two-core and a four-core mobile part andall the way to 95 W for a high-end desktop part. The IA coresand PG are powered from independent 0.65 to 1.15 V variablevoltage power supply sources, all controlled by the SVID bus.TheDDR3 interface uses a 1.5-V power planewhile the PCIe in-terface uses a 1.05-V power plane. The die photograph is shownin Fig. 23.

REFERENCES

[1] “Intel® 64 and IA-32ArchitecturesOptimization Reference Manual,”[Online]. Available: http://www.intel.com/Assets/PDF/manual/248966.pdf

[2] F. Seevinck and J. Lohstroh, “Static- noise margin analysis of MOSSRAM cells,” IEEE Journal of Solid-State Circuits, vol. SC-22, no. 5,pp. 748–754, Oct. 1987.

[3] S. Rusu et al., “A 45 nm 8-core enterprise Xeon processor,” in ISSCCTech. Dig. Papers, 2009.

Page 12: Integrated Multi-CPU

YUFFE et al.: A FULLY INTEGRATED MULTI-CPU, PROCESSOR GRAPHICS, AND MEMORY CONTROLLER 32-NM PROCESSOR 205

[4] E. Fayneh and E. Knoll, “Clock generation and distribution for intelbanias mobile microprocessor,” in Proc. VLSI Circuits Symp., 2003,pp. 17–20.

[5] E. Rotem, A. Mendelson, R. Ginosar, and U. Weiser, “Multiple clockand voltage domains for chip multi processors,” in Proc. 42nd Annu.IEEE/ACM Int. Symp. Micro Architecture (MICRO 42), New York,2009, pp. 459–468.

[6] J. Shor, “Low noise linear voltage regulator for use as an on-chip PLLsupply in microprocessors,” in Proc. IEEE Int. Symp. Circuit Syst.,Paris, France, May 30, 2010, pp. 841–844.

[7] D. Duarte, G. Geannopoulos, U. Mughal, K. L. Wong, and G. Taylor,“Temperature sensor design in a high volume manufacturing 65 nmCMOS digital process,” in Proc. IEEE Custom Integr. Circuits Conf.,Sep. 2007, pp. 221–224.

[8] K. Luria and J. Shor, “Miniaturized CMOS thermal sensor array fortemperature gradient measurement in microprocessors,” in Proc. IEEEInt. Symp. Circuit Syst., Paris, France, May 30, 2010, pp. 1855–1858.

Marcelo Yuffe received the B.Sc. degree in electricalengineering from the Technion – Israel institute ofTechnology, Haifa, Israel, in 1991.He joined Intel Corporation, Haifa, Israel, in 1990,

where he is a Senior Principal Engineer. He dealswith special circuit design for CPUs, mainly I/O andclock and power delivery circuits.

MotyMehalel received the B.Sc. degree in electricalengineering from the Technion – Israel institute ofTechnology, Haifa, Israel, in 1980.He is a Senior Principal Engineer with Intel Cor-

poration, Haifa, Israel. He was with Tadiran Commu-nication Ltd. from 1984 to 1988, focusing on DSPhardware development. He joined Intel in 1988 as aDesign Engineer. Since then, he has been working onCache design, Cache testing, techniques for loweringthe minimum operational voltage, global circuit de-sign methodologies, and low-power design.

Ernest Knoll received the B.S.E.E. degree from thePolytechnic University, Iassi, Romania, in 1980.He joined Intel Corporation, Haifa, Israel, in 1990

and worked for several CPUs generations, with focuson clock generation and distribution. He is currently aSenior Principal Engineer. He holds 18 U.S. patents,all in the analog circuit design area, and has authoredor coauthored five technical papers.

Joseph Shor (SM’11) received the B.A. degree inphysics from Queens College, Queens, NY, in 1986,and the Ph.D. degree in electrical engineering fromColumbia University, New York, NY, in 1993.From 1988 to 1994, he was a Senior Research Sci-

entist with Kulite Semiconductor, where he devel-oped processes and devices for silicon carbide anddiamond microsensors. From 1994 to 1999, he wasa Senior Analog Designer with Motorola Semicon-ductor in the DSP Division. Between 1999–2004, hewas with Saifun Semiconductor as a Staff Engineer,

where he established the analog activities for Flash and EEPROMNROMmem-ories. Since 2004, he has been with Intel Corporation, Haifa, Israel, where he ispresently a Principal Engineer and head of the Analog Team at Intel Yakum. Hehas authored or coauthored more than 50 papers in refereed journals and con-ference proceedings in the areas of analog circuit design and device physics.He holds 35 issued patents and several pending patents. His present interestsinclude switching and linear voltage regulators, thermal Sensors, PLLs, and IOcircuits, all for microprocessor applications.

Tsvika Kurts received the B.Sc. and M.Sc. degreesfrom the Technion—Israel Institute of Technology,Haifa, Israel, in 1984 and 1992, respectively.He is currently a Principal Engineer/Architect with

Intel’s Microprocessor Chipset Division, Haifa, Is-rael, leading the DebugArchitecture of Sandy-Bridgeprocessor. He has beenwith Intel for 26 years. Hewaspart of the core team that developed the Pentium M,Centrino platform, and lead the Quad Core architec-ture of Core 2 Duo. Earlier at Intel, he was part ofthe Intel Pentium Pro processor bus architecture and

system validation team and was involved in the Intel Pentium 4 processor busprotocol development.

Michael Zelikson was born in Leningrad, Russia, in1962. He received the B.Sc. , M.Sc., and D.Sc. de-grees from Technion—Israel Institute of Technology,Haifa, Israel, in 1989, 1991, 1995, respectively .His academic research focused on electrical

modulation of optical constants in a-Si:H basedwaveguides. After completing his D.Sc. work, hejoined the IBM Research Division, working in thefield of analog and mixed-signal design in Si:Getechnology, working on interconnect high-bandwidthmodeling, linear amplifiers, and more. Since 2003,

he has been with Intel Corporation, Haifa, having his main field of expertise inpower delivery design and analysis, power management, and voltage regulation

EranAltshuler received the B.Sc. andM.Sc. degreesfrom the Technion—Israel Institute of Technology,Haifa, Israel, in 1987 and 1990, respectively, both inelectrical engineering.In 1991, he joined Intel Corporation, Haifa, Israel,

as a Digital Circuit Design Engineer in the processorsgroup, where he is a Principal Engineer.

Eyal Fayneh received the B.Sc. degree from the Uni-versity of Tel-Aviv, Tel Aviv, Israel, in 1991.He then worked on designing RF circuits and

frequency-generation circuits for radio applicationand clock & clock generation circuits for Motorola,and in 1996 joined Intel Corporation, Haifa, Israel,where he is a Principal Engineer. Currently, he de-signs high-performance clock generators for CPU’score and I/O.

Kosta Luria was born in Moscow, USSR, in 1962. He received the B.S. degreein electrical engineering from Tel Aviv University, Tel Aviv, Israel, in 1991.From 1991 to 1997, he was with Motorola Communications Ltd., developing

analog circuits for wireless modems for use in the SCADA Irrigation appli-cations. In 1997, he joined the analog team at the startup company FriendlyRobotics, which has developed the robotic lawn mower and the robotic vacuumcleaner. He joined Intel and began working on the analog chip design in 2001.His interests include smart temperature sensors, high-quality voltage regulators,bandgap references, AD converters, and special circuits for new applications.