Orthogonal Latin Configuration for Reliability … TRANSACTIONS ON COMPUTERS, MAY 1975 patterns after a double error has occurred. The binary representation of addressin terms of Galois

IEEE TRANSACTIONS ON COMPUTERS, VOL. C-24, NO. 5, MAY 1975

[5] P. J. Klass, "Multiplex system to be tested on B-1," Aviat.Week Space Technol., pp. 37-41, Mar. 5, 1973.

[6] K. L. Peterson and R. S. Babin, "Integrated reliability andsafety analysis of the DC-10 all-weather landing system," pre-sented at the 1973 Annu. Reliability and Maintainability Symp.,Jan. 1973.

[7] T. N. Pyke and R. P. Blanc, "Computer networking technol-ogy-A state of the art review," Computer, pp. 13-19, Aug. 1973.

[8] H. P. Ramanujam, "Decomposition of permutation networks,"IEEE Trans. Comput., vol. C-22, pp. 639-643, July 1973.

[9] T. B. Smith, "A highly modular fault-tolerant computersystem," Ph.D. dissertation, Aeronautics and AstronauticsDep., Massachusetts Inst. Technol., Cambridge, Nov. 1973.

[10] W. W. Weinstein and R. Tavan, "A multiplexed bus usingasynchronous, speed-independent design techniques," C. S.

Draper Laboratory, Cambridge, Mass., Digital Developmentmemo. 611, May 24, 1971.

[111 "New light-carrying glass fiber brings laser communicationcloser," IEEE Spectrum (Focal Points), vol. 10, p. 85, July 1973.

[121 A. L. Hopkins and T. B. Smith, "The architectural elements ofa symmetric fault-tolerant multiprocessor," in Dig., IEEE 4thInt. Symp. Fault-Tolerant Computing, Univ. of Illinois, Urbana,June 1974.

T. Basil Smith, III (M'74), for a photograph and biography, see thisissue, page 505.

Orthogonal Latin Square Configuration for

LSI Memory Yield and Reliability Enhancement

MU Y. HSIAO, FELLOW, IEEE, AND DOUGLAS C. BOSSEN, MEMBER, IEEE

Abstract-When errors occur which exceed the correction capa-bility of an error correcting code, the only recourse to restore theoriginal memory function is to physically replace the failed entity.In this paper the authors propose an automatic reconfiguration tech-nique which uses the concept of address skewing to disperse suchmultiple errors into correctable errors. No additional redundancyother than that required for the error correcting code is needed. Theskewing mechanism is derived using the theory of orthogonal Latinsquares.

Index Terms-Error correction, fault-tolerant large-scale in-tegrated (LSI) memory, Latin square.

I. INTRODUCTIONT HE PROGRESS of large-scale integrated (LSI)

technology has greatly increased the density of semi-conductor memory chips [1]. For example, in FETmemory, a density of 8k -bit/chip has been reported in alaboratory [2]. However, because of the complicatedbatch fabrication process, the yield for mass productionhigh-density chips is still not high. Furthermore, someundiscovered failures or-imperfect quality control duringthe manufacturing process may also cause reliabilityproblems after the chip is used in a computer.Many techniques have been suggested for solving

either the yield problem or the reliability problem, but

Manuscript received August 3, 1974; revised November 14, 1974.The authors are with the IBM Corporation, Poughkeepsie, N. Y.

12602.

a single practical solution for both problems has not pre-viously been found. For example, it has been suggestedto use a multiple error correctin code to solve both prob-lems; i.e., part of the error correction capability can beused for yield enhancement and the other part can beused for reliability enhancement. However, the use oferror correcting code for yield enhancement producesgreater impact on cost and performance, and usuallybecomes unacceptable in a high-speed computer system.

In this paper, a new approach is used to solve both theyield and the reliability problem with minimum cost andperformance impact on a computer system.

II. MEMORY SYSTEM ORGANIZATION ANDCORRESPONDING MATHEMATICAL

STRUCTURE

In a conventional random access memory with built-insingle error correction and double error detection (SEC-DED) capability such as IBM 370 systems, the basicmemory organization can be shown as in Fig. 1.The RAM chips are packaged in terms of bit organiza-

tion, i.e., one bit or multiple bits per card. For example,if a memory has 32k addresses, then the 1 bit/card or-ganization means the card would contain 32k bits with-each bit in a different address. However, these 32k bitsare supplied by one power source and one decoder. Anyfailure (single or multiple) on the card will only causeone bit bad in the addressed word. Therefore, a Hamming

512

HSIAO AND BOSSEN: ORTHOGONAL LATIN SQUARE CONFIGURATION

SEC-DED code will be able to correct the card failure.This is the current state of the art. In this paper, we tryto extend the error correction capability of the SEC-DED code (or any other code) for yield enhancementby introducing some mathematical structure into thememory organization. More specifically, the structure in-troduced is the orthogonal Latin square [3], [4]. Theconcept of using orthogonal Latin squares for errorcorrection is quite different from what we stated pre-viously [5]. The basic definition of Latin square is re-viewed here.

Definition: A Latin square of order (size) m is anm X m square array of the digits O,1, .,m - 1, witheach row and column a permutation of the digits 0,1y...m - 1. Two Latin squares are orthogonal if, when oneLatin square is superimposed on the other, every orderedpair of elements appears only once.

Examples: For m = 5, there exist four possible orthog-onal Latin squares:

0 1 2 3 4

1 2 3 4 0

L1=2 3 4 0 1

34 0 1 2

40 1 2 3

0 1 2 3 4

2 3 4 0 1

L2 = 4 0 1 2 3

1 2 3 4 0

3 4 0 1 2

Address tDR L0AMDECODER: . A

| | | . | |Outout bufl

Error Correction_Unit

To C.P.U.Fig. 1. Basic memory organization.

0123 0231 0312

1032 1320 1203

Address 2301 2013 2130

3210 3102 3021

Card T1 T2 T

Fig. 2. The set of three orthogonal Latin squares of order 4.

0 1 2 3 4

3 4 0 1 2

L3= 1 2 3 4 0

0 1 2 3 4

4 0 1 2 3

L4= 3 4

4 0 1 2 3

2 3 4 0 1

0 1 2

2 3 4 0 1

1 2 3 4 0

TABLE I

Conventional Galois Field ElementaForm Representation

0 = 00 0 = O = 001=01 1 = aO = 102 = 10 2 = al = 013 = 11 3 =Ca2 =11

a a: The root of a primitive polynomial xe + x + 1.

Next, let us show how the orthogonal Latin square struc-ture can be built into a memory organization. This is bestillustrated through the following example.Example: The simplest example is to consider a memory

having four 1-bit-organized memory cards and eachcard having four address positions.

Let us consider the three copies of the 4 X 4 orthogonalLatin squares as shown in Fig. 2.The vertical coordinate is used as the address position

and the horizontal position is used as the card position.Inside each card, the word address number is not codedin the conventional binary form, but rather in the formof Galois field element [6] as shown in Table I.

This new representation requires no change on theinput address line. The three copies of the orthogonal

Latin squares actually are orthogonal to the followingoriginal address square To, i.e.,

0 0 0 0

1 1 1 1

2 2 2 2

3 3 3 31111

This To square represent the original address distributioncondition. T1, T2, and T3 describe possible address skewing

513

hip)

IEEE TRANSACTIONS ON COMPUTERS, MAY 1975

patterns after a double error has occurred. The binaryrepresentation of address in terms of Galois field elementsenables the generation of T1, T2, and T3 by a linear feed-back shift register as shown in Fig. 3. This is very practicalfrom a hardware implementation point of view. Thegeneration of To, T1, T2, and T3 follows directly from thestate transition condition So, S1, S2, S3 of each card. Itshould be noticed that all memory cards are of the samestructure and hence it requires only a single part number.The different address skew pattern T1, T2, and T3 can beachieved by setting the different Si on cards 2, 3, and 4only once. For example, after the S1 = 10 is gated intotwo shift-register cells and becomes the initial state ofg (x) on card 2, it will perform addition module 2 operationon any incoming address. As shown in Fig. 3, e.g., theaddress 00 (D 10 = 10 becomes address 1 = 10 at timeT, on card 2; similarly for other addresses. S2 and S3 canbe obtained by merely shifting the content of the shiftregister on cards 2, 3, and 4. The state transition diagramof the shift register is also included in Fig. 3.One way of operating the memory system is described

as follows:1) At time To, the original copy of the address and bit

configuration is shown in the To square with each g(x)register contents 00 as its initial state. This is exactlyas the conventional memory without any reconfigurationcapability.

2) If a double error is detected by the ECC system, theSi pattern is loaded to each g (x) shift register. Thismeans that the memory system is now in T1 address con-figuration.

3) For the first double error, there is no need to test thememory words to check whether these double errors havebeen separated because the orthogonal Latin squarestructure guarantees the separation of a double error inan address into two single errors at two different addresses.

4) After operating for a period of time, a third errorcomes out of a word which already has a single error in it;then a shifting pulse is sent into the shifting line. Thispulse changes all Si state to S2 and a new copy of orthog-onal Latin square T2 is obtained. In general, it is highlyunlikely that the result of separating this new doubleerror pattern will create another new double error at adifferent word address. However, it may be a good strategyto conduct a diagnostic test to make sure that no doubleerrors exist in any other address.

5) If a new double error is detected in a new address, wecan send one more shifting pulse and switch the addresspattern of the memory system to T3. In this example,three errors can always be corrected. This last switchingguarantees the separation of any possible double errorsin a single address because it is within the capabilityof this specific Latin square example.

In general, as will be discussed below, a large-sizeorthogonal Latin square set can correct many errors and

The existence and construction of high-order orthog-onal Latin squares in terms of Galois field element iswell known. We also have written an APL program togenerate orthogonal Latin squares. For a memory of2r addresses, there exists 2r 1 copies of orthogonal

Latin squares, which is more than necessary for practicalapplications. The generation of these 2r 1 copies oforthogonal Latin squares can always be achieved by anr-stage linear feedback shift register characterized bya primitive polynomial of degree r.

III. STATISTICAL APPROACH FORESTIMATING ERROR CORRECTION

CAPABILITYIt is important to know the maximum number or the

range of errors that can always be corrected by theexistence of (2r - 1) + 1 copies of orthogonal squaresin conjunction with the implementation of SEC-DEDECC system. It is rather difficult to give a rigorous generalproof. However, a simulation program has been writtento evaluate the error correction capability based on therandom error distribution. This program is simply toassume a memory organization with a Latin squareconfiguration. Because of the decoder organization, theLatin square configuration is not performed at the finalsingle address level as described in this paper; but at thelevel where a block of addresses are reconfigured. Thisactually has the effect of reducing the reconfigurationcapability since the number of Latin squares is reduced.This results from performing the skewing function on a

subset of the total number of address bits. In the specificsimulation to be discussed, Latin squares of order 8 wereconsidered. This resulted from performing the skewingfunction on just the three high-order address bits on eachmemory card. The justification for this restricted capa-

bility comes from weighing control register complexityagainst the expected number of failures over the systemlife.

For this simulation, 500 failures were assumed to occur

within a population of 1000 8-mega byte memories over

a five-year period. The results show that 66 percent of thefailures causing multiple errors were reconfigurable withthe Latin square skewing into single bit errors.

In general, the question of "Given an orthogonalLatin square capability in combination with an error

correction code, what is the maximum error correctingcapability?" still remains to be answered. However,the following theorem can be easily proven from theorthogonal property of orthogonal Latin squares.

Theorem: Given a memory consisting of 2r wordseach of length n bits where 2r n, then any multipleerror of k bits in a single word where 1 < k < n can bedispersed into k single bit errors occuring in k differentwords using orthogonal Latin squares of order 2r

Fig. 4 illustrates the idea of the proof by assuming thathas high probability of success in a few trys.

514

all bits in word 4 are bad. In the skewed form using. an

02 0 0 02 z

Addl

ess

Ltnes

0

Card

ICard2

lCard3

Card

4L|

ToT

3To

T1T2

T3To

T2

T3T

TT

T

[~~~~~

~~~~~~

~~~~~~

~~~~~~

~~~~~~

~~~~~~

~~~~~~

~~~~

1211

+~~~~~0

000

+012

3+0

23

0231

DECODER

1111

032

320~~~~~~~~~~~~~~~~~~~~~~~~DCOERc

2222

DECODER

2301)

DECO

DERIt2013

3DECODE

22

22

2 ~~~-3|

O2

|LO

2!

3j

0

01

33332

3210

3102

r~~~~~~~~~~~~~~~~~~~~~~~~~~~~~0

g(x)

(xCOY

.-~~~~~~~~~~~~~~~~~~~~~~~~

L.~~~~~~~~~~~~~~~~~~~~~

x0

S-0

0,S 1

ooS2

00,

S300

so00

,Si

=10,

03=0

1,S3

=-11

so

00,

SI=01,

S2=1

1,53

10so

-00,

Si=1I

S20

30

Shif

ting

pulie

I

S1lx

i(=10W

)SI

Input

=(01)

Snv

I

10~~~~~~~~~I-Cord

Shifting

11

01~~~

~~

~~~~00

123

02310

312

Addi

reu

03

23

201

20

3

2ttt22ziiI

01

2130

333

33

10

203

02

000

10=1IT

T0

13T3

01=

211

=3

(OriginalCo

py)

Ti,

T2andT3

are

copi

esof

orxh

ogon

alLatin

square

n.

Fig.

3.Me

mory

organization

showingLa

tinsquare

configuration.

[4] R. C. Bose, "On the application of the propprties of Galois fieldto the problem of construction of hyper-Graecco-Latin squares,"Sankya, vol. 3, pt. 4, p. 323, 1938.

[5] M. Y. Hsiao, D. C. Bossen, and R. T. Chien, "Orthogonal Latinsquare codes," IBM J. Res. Develop., vol. 14, no. 4, July 1970.

[6] W. W. Peterson and E. J. Weldon, Error Correcting Codes.Cambridge, Mass.: M.I.T. Press, 1972.

Original oddress form

Skewingto

Latin square form

Fig. 4. Skewing pattern produced by a Latin square.

orthogonal Latin square each word now has a single error.

DISCUSSIONFor a given memory of k address lines to produce 2k

different addresses, we can always form 2k - 1 copiesof orthogonal Latin square configurations by using a

degree k primitive polynomial linear feedback shiftregister as described earlier. Since 2k - 1 is usally a

large number, e.g., k = 16, 32, etc., we always haveenough copies of orthogonal Latin squares. Therefore, itis important to use all possible copies for both yield andreliability enhancement.

REFERENCES[1] G. C. Feth, "Memories are bigger, faster, and cheaper," IEEE

Spectrum, vol. 10, pp. 28-35, Nov. 1973.[2] W. K. Huffman and H. L. Kalter, "An 8-k bit random access

memory chip using a one device FET cell," in Proc. 1978 IEEEInt. Solid-StoAe Circuit CQnf., Philadelphia, Pa., Feb. 14, 1973,v. 64-65.

3] .B. Mann, Analysis and Design of Experiments. New York:Dover, 1949.

Mu Y. Hsiao (S'60-M'61-SM'71-F'73) re-

ceived the B.S.E.E. degree from TaiwanUniversity, Taipei, Taiwan, in 1956, the

M.S. degree in mathematics from the Uni-versity of Illinois, Urbana, in 1960, and thePh.D. degree in electrical engineering fromthe University of Florida, Gainesvi.le, in1967.He has been working with the IBM Cor-

poration since 1960 (excepting 1965-1967).He is presently a Senior Engineer and Man-

ager of the Reliability and Technology Analysis Department withthe IBM Corporation, Poughkeepsie, N. Y. He wrote the firstmodem digital computer book in the Chinese language, which waspublished in 1964, 1966, 1968, and 1970 by the Chinese Institute ofElectrical Engineering. He is co-author of Error Detecting Logic forDigital Computers (New York: McGraw-Hill, 1968). He has written27 papers and holds 12 U. S. patents in the fields of error-correctingcodes, switching theory, and logic-checking techniques.

Dr. Hsiao is a member of Sigma Xi, Eta Kappa Nu, Phi KappaPhi, and AAAS. He is also listed in the American Men of Science.He has also received six IBM Patent Achievement Awards and oneOutstanding Invention Award.

Douglas C. Bossen (S'66-M'68) was born inAkron, Ohio, on September 26, 1941. He re-

ceived the B.S.E.E. degree from Northwest-

ern University, Evanston, Ill., in 1964, andthe M.S. and Ph.D. degrees in electricalengineering from Northwestern University in1966 and 1968, respectively.In 1968 he joined the IBM Corporation at

the Systems Development Laboratory,Poughkeepsie, N. Y., working in the Ad-vanced Reliability Department. He has

written papers in the areas of error correcting codes, sequential ma-

chines, and test pattern generation.Dr. Bossen is a member of Sigma Xi, Tau Beta Pi, and Eta

Kappa Nu.

O O o o 0 0 0 O

1 1 1 1 1 1 1 1

2 2 2 2 2 2 2 2

3 3 3 3 3 3 3 3

t 44 4 444 74

5 5 5 5 5 5 5 5

6 6 6 6 6 6 6 6

7 7 7 7 7 7 7 7

Add # 4

SkewedAdd # 4

0 3412 1 5 7 6

17K7 0 6 5 1 3 2

2 1 6 0 3 7 5[E~

1 2 5 3 0E16 7

5 6 1 7F410 2 3

7II3 5 6 2 0 1

6 5 22jI7 3 1 0

3 0 7 1 2 6 F415

Documents

Orthogonal Latin Configuration for Reliability … TRANSACTIONS ON COMPUTERS, MAY 1975 patterns after a double error has occurred. The binary representation of addressin terms of Galois