View
218
Download
2
Category
Preview:
Citation preview
RR
®®
1 Shubu Mukherjee, FACT Group
Cache Scrubbing in Microprocessors: Cache Scrubbing in Microprocessors: Myth or Necessity?Myth or Necessity?
Practical Experience ReportPractical Experience Report
Shubu Mukherjee
Joel Emer, Tryggve Fossum, & Steven K. Reinhardt*
Fault Aware Computing Technology (FACT) Group
Massachusetts Microprocessor Design Center, Intel Corporation
10th IEEE International Symposium Pacific Rim Dependable Computing, French Polynesia, March 3-5, 2004
* Also, University of Michigan, Ann Arbor
RR
®®
2 Shubu Mukherjee, FACT Group
SummarySummary
SECDED ECC (single error correction, double error detection)SECDED ECC (single error correction, double error detection) commonly used in on-chip cachescommonly used in on-chip caches interleaving converts spatial multi-bit errors to multiple single bit interleaving converts spatial multi-bit errors to multiple single bit
errorserrors
ScrubbingScrubbing periodically read cache blocks and correct all single bit errorsperiodically read cache blocks and correct all single bit errors this prevents single bit errors from accumulating, thereby avoiding this prevents single bit errors from accumulating, thereby avoiding
temporal double bit errorstemporal double bit errors
Our conclusion: given detected error target of 10 year MTTF Our conclusion: given detected error target of 10 year MTTF Scrubbing necessary only for very large caches (e.g., 100s of Scrubbing necessary only for very large caches (e.g., 100s of
megabytes to gigabytes)megabytes to gigabytes)
RR
®®
3 Shubu Mukherjee, FACT Group
Origin of Cosmic RaysOrigin of Cosmic Rays
Cosmic rays come from deep spaceCosmic rays come from deep space
Earth’s Surface
p
np
p
n
n
p
p
n
n
n
RR
®®
4 Shubu Mukherjee, FACT Group
Impact of Neutron Strike on a Si DeviceImpact of Neutron Strike on a Si Device
Secondary source of upsets: alpha particles from packagingSecondary source of upsets: alpha particles from packaging
Strikes release electron & hole pairs that can be absorbed by source & drain to alter the state of the device
+- ++ +-- -
Transistor Device
source drain
neutron strike
RR
®®
5 Shubu Mukherjee, FACT Group
Strike Changes State of a Single BitStrike Changes State of a Single Bit
01
Example SolutionExample Solution Error correction codes (ECC) for single bit correctionError correction codes (ECC) for single bit correction Overhead = 7 bits for 64 bits of dataOverhead = 7 bits for 64 bits of data
RR
®®
6 Shubu Mukherjee, FACT Group
Strike Changes State of Two Adjacent BitsStrike Changes State of Two Adjacent BitsSpatial Double Bit ErrorSpatial Double Bit Error
Example solution Example solution SECDED ECC (single error correction, double error detection)SECDED ECC (single error correction, double error detection)
8 bits of code per 64 bits of data8 bits of code per 64 bits of data Interleaving for the more general case … Interleaving for the more general case …
0 11 0
RR
®®
7 Shubu Mukherjee, FACT Group
Interleaving bitsInterleaving bits
Interleaving convertsInterleaving converts spatial multi-bit error spatial multi-bit error multiple single bit errors multiple single bit errors
bits
X X X
X = covered with single ECC code
+ + +
+ = covered with different ECC code
// /00 0
RR
®®
8 Shubu Mukherjee, FACT Group
Two Separate Strikes on Different BitsTwo Separate Strikes on Different BitsTemporal Double Bit ErrorsTemporal Double Bit Errors
SECDED ECC (single error correction, double error detection)SECDED ECC (single error correction, double error detection) could detect error, but cannot correct the errorcould detect error, but cannot correct the error if errors accumulateif errors accumulate
– single bit correctable error becomes a double bit detectable errorsingle bit correctable error becomes a double bit detectable error
Cycle 100 Cycle 1,000,000
RR
®®
9 Shubu Mukherjee, FACT Group
Solutions for Temporal Double Bit ErrorsSolutions for Temporal Double Bit Errors
Natural EffectsNatural Effects whenever a processor reads a cache block, we can correct the single bit errorwhenever a processor reads a cache block, we can correct the single bit error check for errors when cache blocks are replaced from the cachecheck for errors when cache blocks are replaced from the cache
More Powerful ECCMore Powerful ECC SECDED ECC requires 8 bits per 64 bitsSECDED ECC requires 8 bits per 64 bits
– 7 bits for single bit correction7 bits for single bit correction
– 88thth bit for double bit detection bit for double bit detection
– Overhead = 13%Overhead = 13%
ECC with two bit correction requires 12 bits per 64 bitsECC with two bit correction requires 12 bits per 64 bits– Overhead = 19%Overhead = 19%
ScrubbingScrubbing Periodically read memory and correct all single bit errorsPeriodically read memory and correct all single bit errors Disallows accumulation of temporal double bit errorsDisallows accumulation of temporal double bit errors Standard technique in main memories (DRAMs)Standard technique in main memories (DRAMs) Our calculations (later) will assume the worst case for soft errorsOur calculations (later) will assume the worst case for soft errors
– cache blocks don’t get scrubbed naturallycache blocks don’t get scrubbed naturally
RR
®®
10 Shubu Mukherjee, FACT Group
Memory Hierarchy of a ProcessorMemory Hierarchy of a Processor
Do we need to scrub on-chip caches? Do we need to scrub on-chip caches? depends on the size of these cachesdepends on the size of these caches
L1 Cache
CPU
L2 Cache
Main Memory (gigabytes)
megabytes
kilobytes
RR
®®
11 Shubu Mukherjee, FACT Group
Detected Unrecoverable Error (DUE)Detected Unrecoverable Error (DUE)
Interval-basedInterval-based MTTF = Mean Time to Failure MTTF = Mean Time to Failure E.g., goal = 10 years MTTF for application crash E.g., goal = 10 years MTTF for application crash
Bossen, IRPS 2002Bossen, IRPS 2002
Rate-basedRate-based FIT = Failure in Time = 1 failure in a billion hoursFIT = Failure in Time = 1 failure in a billion hours 10 year MTTF = 1010 year MTTF = 1099 / (24 * 365 * 10) FIT = 11,415 FITs / (24 * 365 * 10) FIT = 11,415 FITs
Total of 210 FIT
+
Cache: 62 FITIQ: 100 FITFU: 58 FIT
+
Hypothetical Example
RR
®®
12 Shubu Mukherjee, FACT Group
MTTF calculations: probabilitiesMTTF calculations: probabilities
1 quadword = 64 bits + 8 bits = 72 bits of data + SECDED ECC1 quadword = 64 bits + 8 bits = 72 bits of data + SECDED ECC Q = # quadwords in cache memoryQ = # quadwords in cache memory PPdd[n] = probability that a sequence of n strikes causes n – 1 single bit [n] = probability that a sequence of n strikes causes n – 1 single bit
errors, followed by a double bit error on the nerrors, followed by a double bit error on the n thth strike strike
PPdd[1] = 0[1] = 0
PPdd[2] = 1 / Q[2] = 1 / Q
First Strike, Probability = Q / QSecond Strike, Probability = 1 / QPd[2] = (Q/Q) * (1/Q) = 1/Q
RR
®®
13 Shubu Mukherjee, FACT Group
MTTF calculations: probabilitiesMTTF calculations: probabilities
1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC Q = # quadwords in cache memoryQ = # quadwords in cache memory PPdd[n] = probability that a sequence of n strikes causes n – 1 single bit [n] = probability that a sequence of n strikes causes n – 1 single bit
errors, followed by a double bit error on the nerrors, followed by a double bit error on the n thth strike strike
PPdd[3] = [ (Q-1)/Q ] * [2/Q][3] = [ (Q-1)/Q ] * [2/Q]
First Strike, Probability = Q / Q Second Strike, Probability = (Q-1) / QThird Strike, Probability = 2/Q
Pd[3] = (Q/Q) * (Q-1/Q) * (2/Q)
RR
®®
14 Shubu Mukherjee, FACT Group
MTTF calculations: probabilitiesMTTF calculations: probabilities
1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC1 quadword = 64 bits + 8 bits = 72 bits of SECDED ECC Q = # quadwords in cache memoryQ = # quadwords in cache memory PPdd[n] = probability that a sequence of n strikes causes n – 1 single bit [n] = probability that a sequence of n strikes causes n – 1 single bit
errors, followed by a double bit error on the nerrors, followed by a double bit error on the n thth strike strike
PPdd[1] = 0[1] = 0
PPdd[2] = 1 / Q[2] = 1 / Q
PPdd[3] = [ (Q-1)/Q ] * [2/Q][3] = [ (Q-1)/Q ] * [2/Q]
PPdd[4] = [ (Q-1)/Q ] * [ (Q-2)/Q ] * [3/Q][4] = [ (Q-1)/Q ] * [ (Q-2)/Q ] * [3/Q] …… PPdd[n] = [ (Q-1/Q ] * [ (Q-2)/Q ] * [ (Q-3)/Q ] * … * [ (Q-n+2)/Q ] * [ (n-1)/Q ][n] = [ (Q-1/Q ] * [ (Q-2)/Q ] * [ (Q-3)/Q ] * … * [ (Q-n+2)/Q ] * [ (n-1)/Q ]
RR
®®
15 Shubu Mukherjee, FACT Group
MTTF calculations: EquationMTTF calculations: Equation M = mean # of single bit errors to get a double bit errorM = mean # of single bit errors to get a double bit error
= Expected value of random variable with P= Expected value of random variable with Pdd[n] as the [n] as the
probability distribution functionprobability distribution function M can be easily generated using a computer programM can be easily generated using a computer program MTTF (double bit error) = M * MTTF (single bit error)MTTF (double bit error) = M * MTTF (single bit error)
For a 32 megabyte cache & FIT/bit = 0.001 [Normand 1996, Tosaka 1996]For a 32 megabyte cache & FIT/bit = 0.001 [Normand 1996, Tosaka 1996] MTTF (double bit error) = M * MTTF (single bit error)MTTF (double bit error) = M * MTTF (single bit error)
= 2567 * (1 / Cache FIT)= 2567 * (1 / Cache FIT)
= 2567 * (10= 2567 * (1099 / (0.001 * 2 / (0.001 * 22222 * 72 * 24 * 365)) * 72 * 24 * 365))
= 970 years= 970 years
Saleh, et al.’s, 1990 closed form equationSaleh, et al.’s, 1990 closed form equation MTTF (double bit error) = [ 1 / (72 * f)] * sqrt(MTTF (double bit error) = [ 1 / (72 * f)] * sqrt( / 2Q) / 2Q)
= 970 years, f = FIT/bit= 970 years, f = FIT/bit
RR
®®
16 Shubu Mukherjee, FACT Group
Temporal Double BitTemporal Double BitMTTF variations with cache sizeMTTF variations with cache size
10
100
1000
10000
0.0
01
0.0
02
0.0
03
0.0
04
0.0
05
0.0
06
0.0
07
0.0
08
0.0
09
0.0
1
FIT/bit
MT
TF
in
years
4 MB
16 MB
64 MB
256 MB
FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996)FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) higher at higher altitudes (e.g., 3-5x at 1.5km in Denver)higher at higher altitudes (e.g., 3-5x at 1.5km in Denver)
Temporal double bit error has very small contribution to DUE rateTemporal double bit error has very small contribution to DUE rate compared to a goal of 10 years DUE MTTFcompared to a goal of 10 years DUE MTTF
RR
®®
17 Shubu Mukherjee, FACT Group
MTTF with ScrubbingMTTF with Scrubbing
I = scrubbing interval, scrub at the end of each interval II = scrubbing interval, scrub at the end of each interval I N = # scrubbing intervals to reach MTTF N = # scrubbing intervals to reach MTTF
= Expected value of random variable with probability distribution= Expected value of random variable with probability distribution
function: (1-pf)function: (1-pf)NN * pf, where pf = probability of a temporal double bit * pf, where pf = probability of a temporal double bit
error at the end of an intervalerror at the end of an interval
Assuming 16 GB cache, FIT/bit = 0.001 (Normand 1996, Tosaka 1996), Assuming 16 GB cache, FIT/bit = 0.001 (Normand 1996, Tosaka 1996),
scrub once a year (I = 1 year)scrub once a year (I = 1 year) MTTF(double bit error) = N * IMTTF(double bit error) = N * I
= 2281 * 1 = 2281 years= 2281 * 1 = 2281 years Saleh, et al. 1990 closed form equationSaleh, et al. 1990 closed form equation
2 / [Q * I * (f * 72)2 / [Q * I * (f * 72)22] = 2341 years, f = FIT/bit] = 2341 years, f = FIT/bit
I I I
RR
®®
18 Shubu Mukherjee, FACT Group
Impact of Scrubbing on Impact of Scrubbing on Temporal Double Bit MTTFTemporal Double Bit MTTF
FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996)FIT/bit = 0.001 – 0.01 (Normand 1996, Tosaka 1996) higher at higher altitudes (e.g., 3-5x at 1.5km in Denver)higher at higher altitudes (e.g., 3-5x at 1.5km in Denver)
For 16 gigabytes of cache, scrubbing can helpFor 16 gigabytes of cache, scrubbing can help compared to a DUE MTTF goal of 10 yearscompared to a DUE MTTF goal of 10 years
110
1001000
10000100000
1000000
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0.008
0.009
0.01
FIT/bit
MT
TF
in y
ears
Scrub once a day Scrub once a month
Scrub once a year With no Scrubbing
16 Gigabyte Cache
RR
®®
19 Shubu Mukherjee, FACT Group
SummarySummary
SECDED ECC (single error correction, double error detection)SECDED ECC (single error correction, double error detection) commonly used in on-chip cachescommonly used in on-chip caches interleaving converts spatial multi-bit errors to multiple single bit interleaving converts spatial multi-bit errors to multiple single bit
errorserrors
ScrubbingScrubbing periodically read cache blocks and correct all single bit errorsperiodically read cache blocks and correct all single bit errors this prevents single bit errors from accumulating, thereby avoiding this prevents single bit errors from accumulating, thereby avoiding
temporal double bit errorstemporal double bit errors
Our conclusion: given detected error target of 10 year MTTF Our conclusion: given detected error target of 10 year MTTF Scrubbing necessary only for very large caches (e.g., 100s of Scrubbing necessary only for very large caches (e.g., 100s of
megabytes to gigabytes)megabytes to gigabytes)
RR
®®
20 Shubu Mukherjee, FACT Group
BACKUPSBACKUPS
RR
®®
21 Shubu Mukherjee, FACT Group
Raw soft error rate: 0.001 – 0.010 FIT/bitRaw soft error rate: 0.001 – 0.010 FIT/bit
Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, Y.Tosaka, S.Satoh, K.Suzuki, T.Suguii, H.Ehara, G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray G.A.Woffinden, and S.A.Wender, “Impact of Cosmic Ray Neutron Induced Soft Errors, on Advanced Submicron CMOS Neutron Induced Soft Errors, on Advanced Submicron CMOS circuits,” circuits,” VLSI Symposium on VLSI Technology Digest of VLSI Symposium on VLSI Technology Digest of Technical PapersTechnical Papers, 1996. , 1996.
Normand, “Single Event Upset at Ground Level,” IEEE Normand, “Single Event Upset at Ground Level,” IEEE Transactions on Nuclear Science, Vol. 43, No. 6, December Transactions on Nuclear Science, Vol. 43, No. 6, December 1996.1996.
Recommended