Upload
vuongkiet
View
213
Download
0
Embed Size (px)
Citation preview
IEEE TRANSACTIONS ON COMPUTERS, VOL. C-24, NO. 5, MAY 1975
[5] P. J. Klass, "Multiplex system to be tested on B-1," Aviat.Week Space Technol., pp. 37-41, Mar. 5, 1973.
[6] K. L. Peterson and R. S. Babin, "Integrated reliability andsafety analysis of the DC-10 all-weather landing system," pre-sented at the 1973 Annu. Reliability and Maintainability Symp.,Jan. 1973.
[7] T. N. Pyke and R. P. Blanc, "Computer networking technol-ogy-A state of the art review," Computer, pp. 13-19, Aug. 1973.
[8] H. P. Ramanujam, "Decomposition of permutation networks,"IEEE Trans. Comput., vol. C-22, pp. 639-643, July 1973.
[9] T. B. Smith, "A highly modular fault-tolerant computersystem," Ph.D. dissertation, Aeronautics and AstronauticsDep., Massachusetts Inst. Technol., Cambridge, Nov. 1973.
[10] W. W. Weinstein and R. Tavan, "A multiplexed bus usingasynchronous, speed-independent design techniques," C. S.
Draper Laboratory, Cambridge, Mass., Digital Developmentmemo. 611, May 24, 1971.
[111 "New light-carrying glass fiber brings laser communicationcloser," IEEE Spectrum (Focal Points), vol. 10, p. 85, July 1973.
[121 A. L. Hopkins and T. B. Smith, "The architectural elements ofa symmetric fault-tolerant multiprocessor," in Dig., IEEE 4thInt. Symp. Fault-Tolerant Computing, Univ. of Illinois, Urbana,June 1974.
T. Basil Smith, III (M'74), for a photograph and biography, see thisissue, page 505.
Orthogonal Latin Square Configuration for
LSI Memory Yield and Reliability Enhancement
MU Y. HSIAO, FELLOW, IEEE, AND DOUGLAS C. BOSSEN, MEMBER, IEEE
Abstract-When errors occur which exceed the correction capa-bility of an error correcting code, the only recourse to restore theoriginal memory function is to physically replace the failed entity.In this paper the authors propose an automatic reconfiguration tech-nique which uses the concept of address skewing to disperse suchmultiple errors into correctable errors. No additional redundancyother than that required for the error correcting code is needed. Theskewing mechanism is derived using the theory of orthogonal Latinsquares.
Index Terms-Error correction, fault-tolerant large-scale in-tegrated (LSI) memory, Latin square.
I. INTRODUCTIONT HE PROGRESS of large-scale integrated (LSI)
technology has greatly increased the density of semi-conductor memory chips [1]. For example, in FETmemory, a density of 8k -bit/chip has been reported in alaboratory [2]. However, because of the complicatedbatch fabrication process, the yield for mass productionhigh-density chips is still not high. Furthermore, someundiscovered failures or-imperfect quality control duringthe manufacturing process may also cause reliabilityproblems after the chip is used in a computer.Many techniques have been suggested for solving
either the yield problem or the reliability problem, but
Manuscript received August 3, 1974; revised November 14, 1974.The authors are with the IBM Corporation, Poughkeepsie, N. Y.
12602.
a single practical solution for both problems has not pre-viously been found. For example, it has been suggestedto use a multiple error correctin code to solve both prob-lems; i.e., part of the error correction capability can beused for yield enhancement and the other part can beused for reliability enhancement. However, the use oferror correcting code for yield enhancement producesgreater impact on cost and performance, and usuallybecomes unacceptable in a high-speed computer system.
In this paper, a new approach is used to solve both theyield and the reliability problem with minimum cost andperformance impact on a computer system.
II. MEMORY SYSTEM ORGANIZATION ANDCORRESPONDING MATHEMATICAL
STRUCTURE
In a conventional random access memory with built-insingle error correction and double error detection (SEC-DED) capability such as IBM 370 systems, the basicmemory organization can be shown as in Fig. 1.The RAM chips are packaged in terms of bit organiza-
tion, i.e., one bit or multiple bits per card. For example,if a memory has 32k addresses, then the 1 bit/card or-ganization means the card would contain 32k bits with-each bit in a different address. However, these 32k bitsare supplied by one power source and one decoder. Anyfailure (single or multiple) on the card will only causeone bit bad in the addressed word. Therefore, a Hamming
512
HSIAO AND BOSSEN: ORTHOGONAL LATIN SQUARE CONFIGURATION
SEC-DED code will be able to correct the card failure.This is the current state of the art. In this paper, we tryto extend the error correction capability of the SEC-DED code (or any other code) for yield enhancementby introducing some mathematical structure into thememory organization. More specifically, the structure in-troduced is the orthogonal Latin square [3], [4]. Theconcept of using orthogonal Latin squares for errorcorrection is quite different from what we stated pre-viously [5]. The basic definition of Latin square is re-viewed here.
Definition: A Latin square of order (size) m is anm X m square array of the digits O,1, .,m - 1, witheach row and column a permutation of the digits 0,1y...m - 1. Two Latin squares are orthogonal if, when oneLatin square is superimposed on the other, every orderedpair of elements appears only once.
Examples: For m = 5, there exist four possible orthog-onal Latin squares:
0 1 2 3 4
1 2 3 4 0
L1=2 3 4 0 1
34 0 1 2
40 1 2 3
0 1 2 3 4
2 3 4 0 1
L2 = 4 0 1 2 3
1 2 3 4 0
3 4 0 1 2
Address tDR L0AMDECODER: . A
| | | . | |Outout bufl
Error Correction_Unit
To C.P.U.Fig. 1. Basic memory organization.
0123 0231 0312
1032 1320 1203
Address 2301 2013 2130
3210 3102 3021
Card T1 T2 T
Fig. 2. The set of three orthogonal Latin squares of order 4.
0 1 2 3 4
3 4 0 1 2
L3= 1 2 3 4 0
0 1 2 3 4
4 0 1 2 3
L4= 3 4
4 0 1 2 3
2 3 4 0 1
0 1 2
2 3 4 0 1
1 2 3 4 0
TABLE I
Conventional Galois Field ElementaForm Representation
0 = 00 0 = O = 001=01 1 = aO = 102 = 10 2 = al = 013 = 11 3 =Ca2 =11
a a: The root of a primitive polynomial xe + x + 1.
Next, let us show how the orthogonal Latin square struc-ture can be built into a memory organization. This is bestillustrated through the following example.Example: The simplest example is to consider a memory
having four 1-bit-organized memory cards and eachcard having four address positions.
Let us consider the three copies of the 4 X 4 orthogonalLatin squares as shown in Fig. 2.The vertical coordinate is used as the address position
and the horizontal position is used as the card position.Inside each card, the word address number is not codedin the conventional binary form, but rather in the formof Galois field element [6] as shown in Table I.
This new representation requires no change on theinput address line. The three copies of the orthogonal
Latin squares actually are orthogonal to the followingoriginal address square To, i.e.,
0 0 0 0
1 1 1 1
2 2 2 2
3 3 3 31111
This To square represent the original address distributioncondition. T1, T2, and T3 describe possible address skewing
513
hip)
IEEE TRANSACTIONS ON COMPUTERS, MAY 1975
patterns after a double error has occurred. The binaryrepresentation of address in terms of Galois field elementsenables the generation of T1, T2, and T3 by a linear feed-back shift register as shown in Fig. 3. This is very practicalfrom a hardware implementation point of view. Thegeneration of To, T1, T2, and T3 follows directly from thestate transition condition So, S1, S2, S3 of each card. Itshould be noticed that all memory cards are of the samestructure and hence it requires only a single part number.The different address skew pattern T1, T2, and T3 can beachieved by setting the different Si on cards 2, 3, and 4only once. For example, after the S1 = 10 is gated intotwo shift-register cells and becomes the initial state ofg (x) on card 2, it will perform addition module 2 operationon any incoming address. As shown in Fig. 3, e.g., theaddress 00 (D 10 = 10 becomes address 1 = 10 at timeT, on card 2; similarly for other addresses. S2 and S3 canbe obtained by merely shifting the content of the shiftregister on cards 2, 3, and 4. The state transition diagramof the shift register is also included in Fig. 3.One way of operating the memory system is described
as follows:1) At time To, the original copy of the address and bit
configuration is shown in the To square with each g(x)register contents 00 as its initial state. This is exactlyas the conventional memory without any reconfigurationcapability.
2) If a double error is detected by the ECC system, theSi pattern is loaded to each g (x) shift register. Thismeans that the memory system is now in T1 address con-figuration.
3) For the first double error, there is no need to test thememory words to check whether these double errors havebeen separated because the orthogonal Latin squarestructure guarantees the separation of a double error inan address into two single errors at two different addresses.
4) After operating for a period of time, a third errorcomes out of a word which already has a single error in it;then a shifting pulse is sent into the shifting line. Thispulse changes all Si state to S2 and a new copy of orthog-onal Latin square T2 is obtained. In general, it is highlyunlikely that the result of separating this new doubleerror pattern will create another new double error at adifferent word address. However, it may be a good strategyto conduct a diagnostic test to make sure that no doubleerrors exist in any other address.
5) If a new double error is detected in a new address, wecan send one more shifting pulse and switch the addresspattern of the memory system to T3. In this example,three errors can always be corrected. This last switchingguarantees the separation of any possible double errorsin a single address because it is within the capabilityof this specific Latin square example.
In general, as will be discussed below, a large-sizeorthogonal Latin square set can correct many errors and
The existence and construction of high-order orthog-onal Latin squares in terms of Galois field element iswell known. We also have written an APL program togenerate orthogonal Latin squares. For a memory of2r addresses, there exists 2r 1 copies of orthogonal
Latin squares, which is more than necessary for practicalapplications. The generation of these 2r 1 copies oforthogonal Latin squares can always be achieved by anr-stage linear feedback shift register characterized bya primitive polynomial of degree r.
III. STATISTICAL APPROACH FORESTIMATING ERROR CORRECTION
CAPABILITYIt is important to know the maximum number or the
range of errors that can always be corrected by theexistence of (2r - 1) + 1 copies of orthogonal squaresin conjunction with the implementation of SEC-DEDECC system. It is rather difficult to give a rigorous generalproof. However, a simulation program has been writtento evaluate the error correction capability based on therandom error distribution. This program is simply toassume a memory organization with a Latin squareconfiguration. Because of the decoder organization, theLatin square configuration is not performed at the finalsingle address level as described in this paper; but at thelevel where a block of addresses are reconfigured. Thisactually has the effect of reducing the reconfigurationcapability since the number of Latin squares is reduced.This results from performing the skewing function on a
subset of the total number of address bits. In the specificsimulation to be discussed, Latin squares of order 8 wereconsidered. This resulted from performing the skewingfunction on just the three high-order address bits on eachmemory card. The justification for this restricted capa-
bility comes from weighing control register complexityagainst the expected number of failures over the systemlife.
For this simulation, 500 failures were assumed to occur
within a population of 1000 8-mega byte memories over
a five-year period. The results show that 66 percent of thefailures causing multiple errors were reconfigurable withthe Latin square skewing into single bit errors.
In general, the question of "Given an orthogonalLatin square capability in combination with an error
correction code, what is the maximum error correctingcapability?" still remains to be answered. However,the following theorem can be easily proven from theorthogonal property of orthogonal Latin squares.
Theorem: Given a memory consisting of 2r wordseach of length n bits where 2r n, then any multipleerror of k bits in a single word where 1 < k < n can bedispersed into k single bit errors occuring in k differentwords using orthogonal Latin squares of order 2r
Fig. 4 illustrates the idea of the proof by assuming thathas high probability of success in a few trys.
514
all bits in word 4 are bad. In the skewed form using. an
02 0 0 02 z
Addl
ess
Ltnes
0
Card
ICard2
lCard3
Card
4L|
ToT
3To
T1T2
T3To
T2
T3T
TT
T
[~~~~~
~~~~~~
~~~~~~
~~~~~~
~~~~~~
~~~~~~
~~~~~~
~~~~
1211
+~~~~~0
000
+012
3+0
23
0231
DECODER
1111
032
320~~~~~~~~~~~~~~~~~~~~~~~~DCOERc
2222
DECODER
2301)
DECO
DERIt2013
3DECODE
22
22
2 ~~~-3|
O2
|LO
2!
3j
0
01
33332
3210
3102
r~~~~~~~~~~~~~~~~~~~~~~~~~~~~~0
g(x)
(xCOY
.-~~~~~~~~~~~~~~~~~~~~~~~~
L.~~~~~~~~~~~~~~~~~~~~~
x0
S-0
0,S 1
ooS2
00,
S300
so00
,Si
=10,
03=0
1,S3
=-11
so
00,
SI=01,
S2=1
1,53
10so
-00,
Si=1I
S20
30
Shif
ting
pulie
I
S1lx
i(=10W
)SI
Input
=(01)
Snv
I
10~~~~~~~~~I-Cord
Shifting
11
01~~~
~~
~~~~00
123
02310
312
Addi
reu
03
23
201
20
3
2ttt22ziiI
01
2130
333
33
10
203
02
000
10=1IT
T0
13T3
01=
211
=3
(OriginalCo
py)
Ti,
T2andT3
are
copi
esof
orxh
ogon
alLatin
square
n.
Fig.
3.Me
mory
organization
showingLa
tinsquare
configuration.
[4] R. C. Bose, "On the application of the propprties of Galois fieldto the problem of construction of hyper-Graecco-Latin squares,"Sankya, vol. 3, pt. 4, p. 323, 1938.
[5] M. Y. Hsiao, D. C. Bossen, and R. T. Chien, "Orthogonal Latinsquare codes," IBM J. Res. Develop., vol. 14, no. 4, July 1970.
[6] W. W. Peterson and E. J. Weldon, Error Correcting Codes.Cambridge, Mass.: M.I.T. Press, 1972.
Original oddress form
Skewingto
Latin square form
Fig. 4. Skewing pattern produced by a Latin square.
orthogonal Latin square each word now has a single error.
DISCUSSIONFor a given memory of k address lines to produce 2k
different addresses, we can always form 2k - 1 copiesof orthogonal Latin square configurations by using a
degree k primitive polynomial linear feedback shiftregister as described earlier. Since 2k - 1 is usally a
large number, e.g., k = 16, 32, etc., we always haveenough copies of orthogonal Latin squares. Therefore, itis important to use all possible copies for both yield andreliability enhancement.
REFERENCES[1] G. C. Feth, "Memories are bigger, faster, and cheaper," IEEE
Spectrum, vol. 10, pp. 28-35, Nov. 1973.[2] W. K. Huffman and H. L. Kalter, "An 8-k bit random access
memory chip using a one device FET cell," in Proc. 1978 IEEEInt. Solid-StoAe Circuit CQnf., Philadelphia, Pa., Feb. 14, 1973,v. 64-65.
3] .B. Mann, Analysis and Design of Experiments. New York:Dover, 1949.
Mu Y. Hsiao (S'60-M'61-SM'71-F'73) re-
ceived the B.S.E.E. degree from TaiwanUniversity, Taipei, Taiwan, in 1956, the
M.S. degree in mathematics from the Uni-versity of Illinois, Urbana, in 1960, and thePh.D. degree in electrical engineering fromthe University of Florida, Gainesvi.le, in1967.He has been working with the IBM Cor-
poration since 1960 (excepting 1965-1967).He is presently a Senior Engineer and Man-
ager of the Reliability and Technology Analysis Department withthe IBM Corporation, Poughkeepsie, N. Y. He wrote the firstmodem digital computer book in the Chinese language, which waspublished in 1964, 1966, 1968, and 1970 by the Chinese Institute ofElectrical Engineering. He is co-author of Error Detecting Logic forDigital Computers (New York: McGraw-Hill, 1968). He has written27 papers and holds 12 U. S. patents in the fields of error-correctingcodes, switching theory, and logic-checking techniques.
Dr. Hsiao is a member of Sigma Xi, Eta Kappa Nu, Phi KappaPhi, and AAAS. He is also listed in the American Men of Science.He has also received six IBM Patent Achievement Awards and oneOutstanding Invention Award.
Douglas C. Bossen (S'66-M'68) was born inAkron, Ohio, on September 26, 1941. He re-
ceived the B.S.E.E. degree from Northwest-
ern University, Evanston, Ill., in 1964, andthe M.S. and Ph.D. degrees in electricalengineering from Northwestern University in1966 and 1968, respectively.In 1968 he joined the IBM Corporation at
the Systems Development Laboratory,Poughkeepsie, N. Y., working in the Ad-vanced Reliability Department. He has
written papers in the areas of error correcting codes, sequential ma-
chines, and test pattern generation.Dr. Bossen is a member of Sigma Xi, Tau Beta Pi, and Eta
Kappa Nu.
O O o o 0 0 0 O
1 1 1 1 1 1 1 1
2 2 2 2 2 2 2 2
3 3 3 3 3 3 3 3
t 44 4 444 74
5 5 5 5 5 5 5 5
6 6 6 6 6 6 6 6
7 7 7 7 7 7 7 7
Add # 4
SkewedAdd # 4
0 3412 1 5 7 6
17K7 0 6 5 1 3 2
2 1 6 0 3 7 5[E~
1 2 5 3 0E16 7
5 6 1 7F410 2 3
7II3 5 6 2 0 1
6 5 22jI7 3 1 0
3 0 7 1 2 6 F415