Upload
dorthy-parks
View
213
Download
0
Embed Size (px)
Citation preview
4. Information Redundancy
Reliable System Design 2010by: Amir M. Rahmani
matlab1.ir
Information Redundancy
Code: representing information• - Morse code
Code word: collection of symbols or digit , use to representing information according to the rules of a given code
Binary code: a code in which code words contain only symbols that are either 0 or 1.
Error detection, Error correction Coding often applied to
• - Information transfer: often serial communication through a channel
• - Information storage
matlab1.ir
Start with k-bit data word
Add r code bits to k-bit data Total = n-bit code word (n=k+r) Not all 2n combinations are valid code words For certain encoding schemes - some types of
errors can also be corrected To extract original data - n bits must be decoded Overhead = r/n
• – e.g., for (single-bit) parity, the overhead is 1/n• – additional bits required• – time to encode and decode
matlab1.ir
Hamming distance (d) Number of bits in which two words differ from
each other; d (x,y)=Σ(xk XOR yk)• E.g., 0010 and 1110 have a Hamming distance of 2
Rules:• Iff d (x,y)= 0 then x=y• d (x,y)= d (y,x)• d (x,y)= d (y,z)>= d (x,z)
For a group of code words, d is the minimum of all hamming distance between all possible pairs of code words.
• E.g., {000, 011, 101, 110} have a Hamming distance of 2 d determines the code’s ability to detect and/or
correct errors• – d-1 bit for error detection• – [(d-1)/2] bit for error correction
matlab1.ir
Hamming distance (d)
Two words in this figure are connected by an edge if their d is 1d=2 Can detect single bit errors
matlab1.ir
Hamming distance (d)
The code {000,111} can be used to encode a single data bit. 0 can be encoded as 000 and 1 as 111. This code is identical to TMRd=3 Can detect single & double bit errors, can correct single bit errors
matlab1.ir
Separability of a Code
A code is separable if it has separate fields for the data and the code bits.
Decoding consists of disregarding the code bits The code bits can be processed separately to
verify the correctness of the data
A non-separable code has the data and code bits integrated together - extracting the data from the encoded word requires some processing
matlab1.ir
Single-bit Parity
Simplest separable error detection code• – Adds one bit of redundancy to each data word
Encoding and decoding cost is low Even (odd) parity: add bit such that total number of
ones in code word is even (odd)• – E.g., 001010 gets a parity bit of 0 for even parity (1 for
odd) Can detect all single-bit errors (All odd-bit errors)
• – Hamming distance >= 2• – Could be greater than 2 if data words don’t use all bit
combinations Drawbacks:
• – Unable to detect common even errors
matlab1.ir
Single-bit Even Parity
matlab1.ir
Even or Odd Parity?
The decision depends on which type of all-bits error is more probable
For even parity - the parity bit for the all zeroes data word will be 0 and an all-0’s failure will go undetected - it is a valid code word
Selecting the odd parity code will allow the detection of the all-0's failure
If all-1's failure is more likely - the odd parity code must be selected if the total number of bits (n+1) is even, and the even parity if n+1 is odd
matlab1.ir
Byte-Interlaced Parity Code
Example: n=64, data bits - a63,a62,…,a0 Eight parity bits: First - parity bit of a63,a55,a47,a39,a31,a23,a15,a7 -
the most significant bits in the eight bytes Remaining seven parity bits - assigned so that
the corresponding groups of bits are interlaced
Scheme is beneficial when shorting of adjacent bits is a common failure mode (example - a bus)
If parity type (odd or even) is alternated between groups - unidirectional errors (all-0's or all-1's) will also be detected
matlab1.ir
Overlapping Parity Code
Simplest scheme; data is organized in a 2-dimensional array
Bits at the end of row - parity over that row Bits at the bottom of column - parity over column Error correcting code?
• - A single-bit error anywhere will cause a row and a column to be erroneous
This identifies a unique erroneous bit This is an example of overlapping parity - each bit is
covered by more than one parity bit
matlab1.ir
Checksum Separable code Checksum is the sum of the original data All checksum schemes allow error detection but not error
location - entire block of data must be retransmitted if an error is detected
a) Single-precision checksum• – overflow problem, i.e. adding n bits modulo 2n
b) Double-precision checksum• – uses double precision, i.e. compute 2n-bit checksum from n-bit
words using modulo 22n arithmetic. c) Residue checksum
• – like single-precision checksum, but overflow is now fed back as carry
d) Honeywell checksum• – compose word of double length by concatenating 2 consecutive
words (done modulo 22n)• – compute checksum on these double words
matlab1.ir
Comparing the Checksum Types
matlab1.ir
Comparison - Example
In Single-precision checksum - transmitted checksum differs from computed checksum
In Honeywell checksum computed checksum differs from received checksum and error is detected
matlab1.ir
Cyclic Codes
Cyclic codes are often non-separable although separable cyclic codes exist
Encoding consists of dividing the data word by a constant number
The coded word is the product Decoding is dividing by the same constant - if
the remainder is non-zero, an error has occurred Cyclic codes are widely used in data storage and
communication
matlab1.ir
Cyclic Redundancy Checks (CRC)
CRC is based on a mathematical calculation performed on message.
We will use the following terms: M - Message to be sent (k bits) F - Frame Check Sequence (FCS) or CRC to be
appended to message (n bits) T - Transmitted message includes both M and F
=> (k+n bits) G - n+1 bit pattern (called polynomial generator)
used to calculate F and check T
matlab1.ir
Cyclic Redundancy Check (CRC) Key idea
• – given a k-bit frame (message)• – transmitter generates a n-bit sequence called frame check
sequence (FCS)• – so that resulting frame of size k+n is exactly divisible by
some predetermined number Multiply M by 2n to shift, and add F to padded 0s
• T = 2 n M + F Dividing 2nM by G gives quotient and remainder
(remainder is 1 bit less than divisor)• 2 n M/G = Q + R/G
then using R as our FCS we get• T = 2 n M + F
on the receiving end, division by G leads toT/G = (2 n M +R)/G = Q + R/G +R/G =Q
If remainder is non-zero, it’s an error
matlab1.ir
Cyclic Redundancy Check (CRC)
Example, assume G(X) has at least 3 terms• – G(x) has 3 1-bits• » detects all single bit errors• » detects all double bit errors• » detects odd #’s of errors if G(X) contains the
factor (X + 1)• » any burst errors < length of FCS• » most larger burst errors
matlab1.ir
Cyclic Redundancy Check (CRC)
A polynomial view:• variable X with binary coefficients, where the
coefficients correspond to the bits in the number.• M = 110011, M(X) = X5 + X4 + X + 1, and for G
= 11001 we have G(X) = X4 + X3 + 1• Math is still mod 2
• » An error E(X) is received, and undetected iff it is divisible by G(X)
matlab1.ir
CRC ExampleM = 10110100011, G = 1101 ; XOR instead of Minus
10110100011 000 | 1101
1101
1100
1101
1100
1101
1011
1101
1100
1101
100=> CRC = 100
matlab1.ir
Cyclic Redundancy Check (CRC) Pre-defined polynomial examples:
• • CRC-12: X12+X11+X3+X2+X+1• • CRC-16: X16+X15+X2+1• • CRC-CCITT = X16 + X12 + X5 + 1
Why is CRC popular?• • Easy to implement! Just need shifters and XORs
Hardware Implementation:• G(X) = 1 + a1X +a2 X + …+ an-1 Xn-1 + an X n
matlab1.ir
Hamming Code (7,4)
Class of (n,k) Hamming codes, e.g., (7,4) [r= n-k =3] Let i1, i2, i3, i4 be the information bits Let p1, p2, p4 be the check bits p1 = i1 XOR i2 XOR i4
p2 = i1 XOR i3 XOR i4
p4 = i2 XOR i3 XOR i4
Unordered code
To detect all unidirectional errors• M-of-n code• Berger code
matlab1.ir
matlab1.ir
m-of-n codes
All code words are n bits in length and contain exactly m 1’s
Simple implementation Can detect all single errors Can detect all unidirectional multiple
errors
matlab1.ir
Berger Code Separable code
• . counts the number of 1s in the word• . expresses it in binary• . complements it• . appends this quantity to the data
Example - encoding 11101• . Four 1s• . 100 in binary• . 011 after complementing• . the encoded word 11101011
Can detect all single errors Can Detects all unidirectional bit errors - one or more 1s
turn to 0s and no 0s turn to 1s (or vice versa) Overhead = r/(k+r)
• k data bits - at most k 1s , r =[log 2(k+1)] redundant bits
matlab1.ir
Other Coding Schemes
Many Error Detecting/Correcting codes exist• – E.g., Arithmetic codes, Reed-Solomon codes, Residue
codes, Bi-Residue codes, etc.
Many of them require more mathematic than belongs in this course
Reasons for other types of codes• – Burst errors• – Byte errors• – Cost/Performance• – Multiple-bit errors• – Ease of hardware implementation
matlab1.ir
Error Recovery
Probably the most important phase of any fault-tolerance technique
Two approaches: • Forward• Backward
matlab1.ir
Forward Error Recovery Forward Error Recovery continues from an
erroneous state by making selective corrections to the system state
This includes making safe the controlled environment which may be damaged because of the failure
It is system specific and depends on accurate predictions of the location and cause of errors
Examples: redundant pointers in data structures and the use of self-correcting codes such as Hamming Codes
matlab1.ir
Backward Error Recovery (BER) If error detected, recover backwards & re-execute
• – Recover to previous state of system that we know is error-free• – Assumes that error will be gone by time of re-execution
Some terminology:• – Recovery point: the point to which we recover in case of error• – Check pointing: periodically saving state of system• – Logging: saving changes made to system state
Many commercial machines use BER• – Sequoia, Synapse N+1, Tandem NonStop
BER also includes all-software schemes• – Nightly backups of file systems
May sacrifice performance to achieve availability• – Where might we lose performance?• – May not be suitable for real-time systems
Disadvantage• – it cannot undo errors in the environment!
matlab1.ir
The Domino Effect With concurrent processes that interact with each
other, BER is more complex Consider:
R22
R21
R13
R12
R11
IPC4
IPC3
IPC2
IPC1
Exe
cuti
on ti
me
P1 P2
If the error is detected in P1 rollback to R13.
If the error is detected in P2?
matlab1.ir
6 BER Issues
1- What state needs to be saved?
2- How do we save this state?
3- Where do we save it?
4- How often do we save it?
5- How do we recover the system to this state?
6- How do we resume execution after recovery?
matlab1.ir
1- What State needs to be saved
Need to save all state that would be necessary if this were to become the recovery point
In general, we only need to save the user-visible state
For example, microprocessors:• – Must save architectural state• – Don’t have to worry about micro-
architectural state
matlab1.ir
2- How to Save State
Two “hints” of BER:• – Check pointing: Periodically stop system and save state• – Logging: Log all changes to state
Check pointing• – Only suffers overhead at periodic checkpoints• – Can only recover at coarse granularity• – Size of checkpoint is often fixed
Logging• – Finer granularity of rollback• – suffers overhead for logging many common operations• – Amount of state logged is variable
matlab1.ir
3- Where to Save State
Have to save state where it is “reliable”• – A fault in the recovery point state could make recovery
impossible In processor (can’t survive loss of processor chip)
• – Processor saves registers to shadow registers In cache (same as processor, if on-chip cache)
• – Processor copies registers into cache In memory (memory can be made very reliable)
• – Processor copies registers into memory• – Write-through cache copies data into memory
In disk (maybe the safest, but slow)• – E.g., databases log updates to disks
In tape (too slow except for rare backups)
matlab1.ir
4- When to Save State
Check pointing• – Can choose checkpoint interval
Logging• – Continuously saving state (every time it
changes) For check pointing, a larger checkpoint
interval means• – Less overhead due to check pointing (since
less frequent)• – Coarser checkpoint granularity (can’t
recover to arbitrary point)
matlab1.ir
5- How to Recover State
Check pointing: Copy pre-fault recovery point checkpoint into architectural state
Logging: Unroll log to undo changes since recovery point
Tradeoff between these two depends on system
matlab1.ir
6- How to Resume Execution
Simply resuming execution after recovery may not be possible
• – E.g., recovery due to hard fault in interconnection switch
May need to reconfigure before resuming, to ensure forward progress
• – E.g., reconfiguring the routing in interconnect to avoid dead switch
matlab1.ir
Implementing EDC/ECC in Hardware
Where does EDC/ECC get used?• – Disk, CD-ROM• – Memory (DRAM, SRAM)• – Buses
Tradeoff between EDC and ECC ECC: Forward error recovery
• – Often on critical path, so can slow down even fault-free system
• - in a ony-way transmition EDC: Backward error recovery
• – Detecting error requires recovery (can be slow)