4
A Self-Tuning Dynamic Voltage Scaled Processor Using Delay-Error Detection and Correction Shidhartha Das, David Roberts, Seokwoo Lee, Sanjay Pant, David Blaauw, Todd Austin, Trevor Mudge, Krisztian Flautner* Department of EECS, University of Michigan, Ann Arbor, MI *ARM Inc., Cambridge, UK Abstract signal which overwrites the shadow latch data into the errant flip- In this paper, we present the implementation and silicon flop. A distributed pipeline recovery mechanism [1] is implemented measurements results of a 64bit processor fabricated in 0. 18ptm to recover correct pipeline state (figure lb). The minimum allowed technology. The processor employs a delay-error detection and supply voltage is set to ensure that the shadow latch data is correction scheme called Razor to eliminate voltage safety margins guaranteed correct and can be used for error recovery. The duration and scale voltage 120mV below the first failure point. It achieves of the positive clock phase, when the shadow latch is transparent, 44% energy savings over the worst case operating conditions for a determines the sampling delay of the shadow latch. The hold time 0.1% targeted error rate at a fixed frequency of 120MHz. constraint imposed by the shadow latch was met by inserting delay buffers. 1. Introduction Recently, we proposed a new voltage management concept for The rest of the paper is organized as follows. Section 2 describes Dynamic Voltage Scaled (DVS) processors, called Razor [1], along the transistor level design of the Razor flip-flop and the with initial simulation based results. In this paper, we present the metastability detector. Section 3 describes Razor processor first silicon implementation of a Razor design. We discuss the implementation details and measurement results are presented in circuit structures used in this new implementation and present Section 4. The Razor energy savings are quantified in Section 5. silicon measurements for 33 tested dies. The chip implements a We give the details on Razor voltage control in Section 6 and draw subset of the Alpha instruction set and was fabricated with our conclusions in Section 7. MOSIS [7] in tsmc 0.1I8,um technology. 2. Razor Flip-Flop Circuit Level Implementation Traditional DVS techniques [2-6] use a delay chain or a lookup Figure 2 shows the transistor level schematic of the RFF. The error table to determine the minimum voltage necessary for error-free comparator evaluates in the negative phase when the data latched operation at a particular frequency. Hence they require voltage by the slave differs from the shadow. The metastability detector margins to ensure correct operation over process variation, which shares the dynamic node Errdyn with the comparator temperature fluctuations and voltage drop. In contrast, the Razor evaluates in the positive phase of the clock when the slave output processor uses a delay-error tolerant flip-flop on critical paths to could become metastable. Thus, the RFF error signal is flagged detect when voltage is scaled to the point of first failure for a given when either evaluate. This, in turn, evaluates the dynamic gate to frequency. Voltage control is based on the observed error rate and generate the restore signal by OR-ing error signals of individual power savings are achieved by 1) eliminating the above margins RFFs. The restore overwrites the master with the shadow latch data under nominal operating and silicon conditions and 2) scaling such that the slave gets the correct data at the next positive edge. In voltage 120mV below the first failure point to achieve a 0.1% this positive phase, it also disables the shadow to protect state. The targeted error rate. The total measured energy savings over the rbar latched signal precharges the Err dyn node for the next errant worst case was 44% at 120MHz under nominal conditions. cycle. Compared to a regular DFF of the same drive strength and delay, the RFF consumes 22% extra (60fJ/49fJ) energy when Figure la shows the delay-error tolerant Razor flip-flop in concept. sampled data is static and 65% extra (205fJ/124fJ) energy when The standard positive edge triggered DFF is augmented with a sampled data switches. However, in the processor only 207 flip- shadow latch which samples at the negative clock edge. Timing flops out of 2388 flip-flops, or 9%, could become critical and errors are detected by comparing the main flip-flop data with that needed to be RFFs. Hence, the net Razor power overhead, of the shadow latch. An additional detector flags the occurrence of including the delay buffer power for short paths, was computed to metastability at the main flip-flop output. Error signals of be 3% of nominal chip power. individual RFFs are OR-ed together to generate the pipeline restore I Latc Error W GlobalMeta-A|l ll ll LI,R Z comparator Detector Fls clk Restore Figur la.Abstact Vew o theRazorFlipFlopFigure lb. Distributed Pipeline Recovery Mechanism 1 -4244-0098-8/06/$20.O ©0(2006 I EEE. 1 ICICDTO6

ASelf-Tuning DynamicVoltage ScaledProcessorUsingDelay

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

A Self-Tuning Dynamic Voltage Scaled Processor Using Delay-Error Detection and Correction
Shidhartha Das, David Roberts, Seokwoo Lee, Sanjay Pant, David Blaauw, Todd Austin, Trevor Mudge, Krisztian Flautner* Department ofEECS, University of Michigan, Ann Arbor, MI
*ARM Inc., Cambridge, UK
Abstract signal which overwrites the shadow latch data into the errant flip- In this paper, we present the implementation and silicon flop. A distributed pipeline recovery mechanism [1] is implemented measurements results of a 64bit processor fabricated in 0. 18ptm to recover correct pipeline state (figure lb). The minimum allowed technology. The processor employs a delay-error detection and supply voltage is set to ensure that the shadow latch data is correction scheme called Razor to eliminate voltage safety margins guaranteed correct and can be used for error recovery. The duration and scale voltage 120mV below the first failure point. It achieves of the positive clock phase, when the shadow latch is transparent, 44% energy savings over the worst case operating conditions for a determines the sampling delay of the shadow latch. The hold time 0.1% targeted error rate at a fixed frequency of 120MHz. constraint imposed by the shadow latch was met by inserting delay
buffers. 1. Introduction
Recently, we proposed a new voltage management concept for The rest of the paper is organized as follows. Section 2 describes Dynamic Voltage Scaled (DVS) processors, called Razor [1], along the transistor level design of the Razor flip-flop and the with initial simulation based results. In this paper, we present the metastability detector. Section 3 describes Razor processor first silicon implementation of a Razor design. We discuss the implementation details and measurement results are presented in circuit structures used in this new implementation and present Section 4. The Razor energy savings are quantified in Section 5. silicon measurements for 33 tested dies. The chip implements a We give the details on Razor voltage control in Section 6 and draw subset of the Alpha instruction set and was fabricated with our conclusions in Section 7. MOSIS[7] in tsmc 0.1I8,um technology.
2. Razor Flip-Flop Circuit Level Implementation Traditional DVS techniques [2-6] use a delay chain or a lookup Figure 2 shows the transistor level schematic of the RFF. The error table to determine the minimum voltage necessary for error-free comparator evaluates in the negative phase when the data latched operation at a particular frequency. Hence they require voltage by the slave differs from the shadow. The metastability detector margins to ensure correct operation over process variation, which shares the dynamic node Errdyn with the comparator temperature fluctuations and voltage drop. In contrast, the Razor evaluates in the positive phase of the clock when the slave output processor uses a delay-error tolerant flip-flop on critical paths to could become metastable. Thus, the RFF error signal is flagged detect when voltage is scaled to the point of first failure for a given when either evaluate. This, in turn, evaluates the dynamic gate to frequency. Voltage control is based on the observed error rate and generate the restore signal by OR-ing error signals of individual power savings are achieved by 1) eliminating the above margins RFFs. The restore overwrites the master with the shadow latch data under nominal operating and silicon conditions and 2) scaling such that the slave gets the correct data at the next positive edge. In voltage 120mV below the first failure point to achieve a 0.1% this positive phase, it also disables the shadow to protect state. The targeted error rate. The total measured energy savings over the rbar latched signal precharges the Err dyn node for the next errant worst case was 44% at 120MHz under nominal conditions. cycle. Compared to a regular DFF of the same drive strength and
delay, the RFF consumes 22% extra (60fJ/49fJ) energy when Figure la shows the delay-error tolerant Razor flip-flop in concept. sampled data is static and 65% extra (205fJ/124fJ) energy when The standard positive edge triggered DFF is augmented with a sampled data switches. However, in the processor only 207 flip- shadow latch which samples at the negative clock edge. Timing flops out of 2388 flip-flops, or 9%, could become critical and errors are detected by comparing the main flip-flop data with that needed to be RFFs. Hence, the net Razor power overhead, of the shadow latch. An additional detector flags the occurrence of including the delay buffer power for short paths, was computed to metastability at the main flip-flop output. Error signals of be 3% of nominal chip power. individual RFFs are OR-ed together to generate the pipeline restore
I Latc ErrorW GlobalMeta-A|l ll ll LI,R Z comparator Detector Fls
clk Restore
1 -4244-0098-8/06/$20.O ©0(2006 IEEE. 1 ICICDTO6
trev
IEEE International Conference on Integrated Circuit Design and Technology. May, 2006. Padova, Italy.
Clk ~ Clk_n ~ Clk_p Restore Restore p
Master Slave Restore_n Clk_p ClkCl
Restore_p
Restor|_pS|sRbarlatchedFF1dXl2 P-skewed FFs Cl_- D ClkJ r Il lr ro Clk n fl- I< > m I > n > r~~~~~~~~~~~~~~~~~~ Latchl
Restore_n
Figure 2a. Razor Flip-Flop Circuit Schematic Figure 2b. Restore Generation Circuitry
The metastability detector consists of p- and n-skewed inverters The probability that metastability propagates through the error which switch to opposite power rails under a meta-stable input detection logic and causes metastability of the restore signal itself voltage. The detector evaluates when input node QS can be was computed to be below 2e-30. Such an event is flagged by the ambiguously interpreted by its fan-out, inverter GJ and the error fail signal generated using double skewed flip-flops. In the rare comparator. The DC transfer curve (figure 3) of inverter GJ, the event of a fail, the pipeline is flushed and the supply voltage is error comparator and the metastability detector show that the immediately increased. During 4 months of chip testing, this event "detection" band is contained well within the ambiguously was never detected. interpreted voltage band. Table 1 gives the error detection and ambiguous interpretation bands for different corners. 3. Processor Implementation Details
Detection Band The die photograph ofthe Razor processor and its details are shown 4* >in figure 4. To verify correct operation, the dcache/register file
river Gl contents were scanned and compared with a PC emulating the same 1.6- code. A 64b register records the number of errant cycles and was
sampled to compute the error rate. An internal clock unit generates
1.2- Meta Corn parator an asymmetric clock with a range between 60 MHz to 400 MHz in 1.2 steps of 20MHz. The shadow latch sampling delay, defined by the
positive clock phase, is configurable from Ops to 3.5ns in steps of °l0.8- 500ps. The clock unit has a separate voltage domain that is not
> Ambiguous Band voltage scaled. Energy savings from Razor DVS were measured at-----l*'h*----- 140 and 120MHz for 33 chips from 2 fabrication runs.
0.4- 3.3mm Error Comparator
Voltage of Node QS Figure 3. DC Transfer Characteristics
Corner Ambiguous Detection
Slow 1.2V 85C 0.57-0.60 0.53-0.64
Typ. 1.2V 40C 0.52-0.58 0.48-0.61
Fast 1.2V 27C 0.48-0.56 0.40-0.61
SloIw 1.8tv 85Cr 0.77-0.87 0.6%7-0.93-
|Fast 1.8V 27C 0.64-0.81 0.58-0.89|_
Table 1. Metastability Detector Corner Analysis Figure 4. Die Photograph of the Chip
1 -4244-0098-8/06/$20.O ©0(2006 IEEE. 2 ICICDTO6
Technology Node 0.18pm offsetting the quadratic savings due to voltage scaling. For the measured chips, the energy optimal error rate fell at approximately
Max. Clock Frequency 140MHz 0.11%. Table 3 shows the measured power at the point of first
DVS Supply Voltage Range 1.2-1.8V failure and the energy per instruction for both the chips at the point of first failure and at the point of 0.10% error rate. At 120MHz, chip
Total Number of Transistors 1.58million 1 consumes 104.5mW at the first failure point and 89.7mW at an Die Size 3.3mm*3.6mm optimal 0.1% error rate, leading to 15% energy savings with
Measured Chip Power at 1.8V 130mW negligible IPC hit. The energy gains for chip 2 are 18%. These gains are in addition to the energy saved by eliminating voltage
Icache Size 8KB margins.
Dcache Size 8KB
Total Number of Flip-Flops 2801 Point of First Failure Point of 0. 1% Error
Total Number of Razor Flip-Flops 207 120MHz Rate
Number of Delay Buffers Added 2388 27C Error Free Operation (Simulation Results) Energy per Energy per
Power Instruction Power Instruction Standard FF Energy (Static/Switching) 49fJ/1 24fJ (Power/I PC/ (Power/I PC/
RFF Energy (Static/Switching) 60fJ/205fJ Freq) Freq)
% Total Chip Power Overhead 2.9% Chipl 104.5mW 87OpJ 89.7mW 74OpJ Chip2 11 9.4mW 99OpJ 99.6mW 83OpJ
Error Correction and Recovery Overhead
Energy of a RFF per error event T 260fJ ] Table 3. Error Rate and Energy/Instruction at Point of First Failure and Point of 0.1% Error Rate
Table 2. Processor Implementation Details Figure 6 shows the distribution of the first failure voltage for the 33
4. Measurement Results measured chips. The first failure voltage for chips 1 and 2 in figure Figure 5 shows the error rates and energy gains versus supply 5 are 1.63V and 1.72V, respectively and hence represent typical voltage at 120 and 140 MHz for two chips. Energy at a particular and worst case process conditions. The scatter plot shows the voltage is normalized with respect to the energy at point of first correlation between the first failure voltage and the 0.10% error rate failure. For all plotted points, correct program execution with Razor voltage. The relative "flatness" of the linear fit indicates less error correction was verified. sensitivity to process variation when running at a 0.1% error rate
than at point of first failure. The distribution of energy savings From the figure, we note that the error rate at the point of first from running at 0.1% error rate at 120MHz and 27C is shown for failure is very low because only a few of the critical paths fail to all chips and ranges from 5% to 23%. The measured error rates at meet the setup requirements. As voltage is scaled further into the different operating temperatures shows 8OmV shift in the point of sub-critical regime the error rate increases exponentially. The IPC first failure from 1.46V to 1.54V for a temperature increase from penalty due to the error recovery cycles is negligible for error rates 45 to 95C as shown in figure 7. below 0.1%. Under such low error rates, the recovery overhead energy is also negligible and the total processor energy shows a 5. Razor Energy Savings quadratic reduction with the supply voltage. At error rates The bar graph in figure 8 shows the energy for the two chips in exceeding 0.1%, the recovery energy rapidly starts to dominate, figure 4 when operating at 120MHz and 45C. The first set of bars
Chip 1 10Chip 210
oX1 00Wt1E -\/>k> 110 ~~~~~~~ 0.1 ~~~~1.10
D0.01 tos (0~.o 1 oo
~1E-3- 12MH tOO> 1E- 120MHz -_-, 0S95Ss W 120MHz~~~~~~~~~~~~~~~~ 10
cDIE-44 \1~A fA,40MHz0-95.Bo 1E-4 | 4vO Ff 10.0- =1E-51 ,-~ 0 =0E1E-5 140MHz 08 0.01~~~~~~~~~~~~~~~~~~~~~~
1E-6; --i - 0t 2e1E-6 -< \ X0zoI\) \ \s1~~~~0.85ZoI-I , , S- 0080
1 E-8 - I 10E-8 -0J0 1.48 1.52 1.56 1.60 1.64 1.52 1.56 1.60 1.64 1.68 1.72 1.76
Voltage (in Volts) Voltage (in Volts) Figure 5. Error Rate and Normalized Energy Measurements for Chips 1 and 2
1 -4244-0098-8/06/$20.O ©2006 IEEE. 3 ICICDT06
12- Lot0 16 LotO 1 C 11 1418
l,,o IILot 1 12 Lotl Ir +0.22117 9Q gL 2 I Linear Fit
5120MHz ~~~~~~~~~~~~~~~~~~~~~~~~ A~ 17 s 8- _
E *4 T-27C E4;
.48 1.56 1.64 1.72 1.80 0 1.4 1.5 1.6 1.7 1.8 Voltage Percentage Savings Vlaea is alr
Point of First Failure Normalized Energy Savings over First Failure Point Vlaea is alr at 0.1% Error Rate
Figure 6. Distribution of Point of First Failure, Point of 0.1% Error Rate and Normalized Energy across 33 measured chips
o.ot 65C 10 LiZnI _ otl 14- ilLot I
1E- 45C\Alo 95C Q. 7- _l_
BIE-4. t 5C 9C 120MHz D 1u 140H
5-0 A~~~~~~~~~~~~~~~~~~~~~~ E -5 65C 9C E 15
IE- 3E3 E z 1- 2Z2 AZ2~~~~~
0 IE-8 0A
1IE-9 39 44 49 54 59 64 41 46 51 56 61 1.40 1.44 1.48 1.52 Percentage Savings Percentage Savings
Voltage of 9.FailbroofmNetEnergy Savings Figure 7. Temperature Dependence of Error Rate Figure 9. Distribution of Net Energy Savings
shows the energy when Razor is turned off and worst-case margins 6. Razor Voltage Control are added to ensure correct operation. For chip 1, 80mV Figure 10 shows the voltage controller, implemented in software temperature margin, 130mV process margin (compared to the that regulates the supply voltage by reacting to error rates. The worst-case chip out of the 33 chips) and an estimated 180mV controller samples the error register and adjusts the supply voltage power supply margin (1000 nominal Vdd) were added. The second to achieve a targeted error rate. The response of the voltage and third sets of bars show the energy when operating with Razor controller for a 0.50% targeted error rate for a test code with at first point of failure and at 0.10 error rate. altern rai and lower rate phases s shown. The controller
settles at 1.52V at high error rate phases and at 1.45V at low error Total energy gains for chip 1 (71mW, 440%) and chip 2 (63mW, rate phases. 390%) are comparable because greater process margin in chip 1l_____________ (lOOmV greater) is compensated by increased savings for chip 2 1-g10H 18 when scaling below first failure point. The distribution of the total @1-\2C0.5- energy savings over the worst case at 120 and 140MHz for all 33 a 2 chips is shown in figure 9 and shows an average savings of
160.5mW162.8mW 27.3mW11 V o 1E13 \4p 10 100 200 300 400 500 600
IIPower 1lSpl|Runtime Samples
SuplyFigre10Rao
E 14 Iniq |Spl |IntegrityFiuelORao Voltage Controller Response *113mW|t5nW |7. Conclusion
°'120 173m 3Om 1194m In this paper, we present a self tuning DVS processor using delayX l3OmVlrcs error tolerant flip-flops. We obtained 4400 energy savings by n Processl 104.5mWl eliminating voltage margins and operating at a 0.100 error rate.
100 | 119.4 119.4 99.6mWCfl 1 mW | lmW Referen2ces O| | ~~~104.5 1l111451 89.7mW |9 [1] D. Ernst, et. al., pp.7-18, MICRO 36.mW l mW 89.71 mW [2] K. Suzuki, et. al., pp. 587-590, CICC 97
80 ImWII II II I [3] S. Sakiyama, et. al., pp. 99-100, Symposium on VLSI circuits, chip1 chip2 chip1 chip2 chip1 chip2 1997 Measured Power Powerwith Razor Power with Razor [4] T. Burd, et. al., pp.294-295, ISSCC, 2000.
with suppiy, DVS DVS [5] 5. Akui, et. al., pp.64-65, ISSCC 2004. temperature when Operating when Operating at' ' and process at Point Point [6] Nowka, et. al., pp.340-341, ISSCC 2002 margins of First Failure of 0.1% Error Rate [7] w mosis . org
Figure 8. Razor Energy Savings @120Mhz
i-4244-0098-8/06/$20.O r2006 IrEEE. 4 ICICDTO6