Datapath Synthesis for Overclocking: Online Arithmetic for ...cas.ee.ic.ac.uk/people/gac1/pubs/KanDAC14.pdf · N +x N y N y N Cin 1 Cin 2 z 2 + z 2-3 N-N+1 N+1-Binary Bits: Digits:

Datapath Synthesis for Overclocking: Online Arithmetic forLatency-Accuracy Trade-offs

Kan Shi, David Boland, Edward Stott, Samuel Bayliss and George A. ConstantinidesDepartment of Electrical and Electronic Engineering

Imperial College LondonLondon, UK

{k.shi11, david.boland03, ed.stott, s.bayliss08, g.constantinides}@imperial.ac.uk

ABSTRACTDigital circuits are currently designed to ensure timing

closure. Releasing this constraint by allowing timing vio-lations could lead to significant performance improvements,but conventional forms of computer arithmetic do not failgracefully when pushed beyond deterministic operation. Inthis paper we take a fresh look at Online Arithmetic, orig-inally proposed for digit serial operation, and synthesizeunrolled digit parallel online operators to allow for grace-ful degradation. We quantify the impact of timing viola-tion on key arithmetic primitives, and show that substan-tial performance benefits can be obtained in comparison tobinary arithmetic. Since timing errors are caused by longcarry chains, these result in errors in least significant dig-its with online arithmetic, causing less impact than conven-tional implementations. Using analytical models and empir-ical FPGA results from an image processing application, wedemonstrate an error reduction over 89% and an improve-ment in SNR of over 20dB for the same clock rate.

Categories and Subject DescriptorsB.2.4 [Arithmetic and Logic Structures]: High-Speed

Arithmetic—Algorithms,Cost/performance

General TermsDesign, Performance, Reliability.

KeywordsOnline Arithmetic, Overclocking, Imprecise Design.

1. INTRODUCTIONCircuit performance has increased tremendously over the

past decades with the continuous scaling of CMOS tech-nology such that they occupy less area and operate faster.Typically static timing analysis incorporates safety marginsfor timing closure. However, continuing with this approachcould become very expensive in the short term future. As

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bearthis notice and the full citation on the first page. Copyrights for componentsof this work owned by others than ACM must be honored. Abstracting withcredit is permitted. To copy otherwise, or republish, to post on servers or toredistribute to lists, requires prior specific permission and/or a fee. Requestpermissions from [email protected]’14, June 01 - 05 2014, San Francisco, CA, USACopyright 2014 ACM 978-1-4503-2730-5/14/06 ...$15.00.http://dx.doi.org/10.1145/2593069.2593118

circuit dimensions scale down, it is increasingly difficult tomaintain low relative manufacturing variation. Because tim-ing margins in static analysis tools increase with processvariability, designing for worst-case delay in circuits becomesincreasingly costly as relative process variation rises.While datapath can be pipelined to boost clock frequency,

this does not reduce overall computational latency. In em-bedded applications, which often have strict latency require-ments, or in any datapath containing feedback where C-slowretiming is inappropriate, pipelining does not help meet per-formance targets. To overcome these problems, a large vol-ume of studies has shown that relaxing absolute datapathaccuracy requirements can provide the freedom to createdesigns with better performance or energy efficiency [1, 7,9]. Specifically, in our previous work we have demonstratedthat allowing timing violations to happen can be beneficial,because the worst case scenario only happens rarely [10].However, we notice that most research in this area fo-

cuses on standard binary arithmetic [10, 7, 9], which poten-tially suffers from large magnitude of timing errors. Thisis because, in standard arithmetic operators such as ripple-carry adders, the most significant digit (MSD) is updatedlast due to carry propagation, so timing violation initiallyaffects the MSD of the result. In this paper, for the first timewe attempt to solve this problem by adopting an alternativeform of MSD-first arithmetic known as online arithmetic.We evaluate the probabilistic behavior of basic arithmeticprimitives with different types of arithmetic when operatingbeyond the deterministic clocking region. We suggest thatin comparison to conventional arithmetic, operators employ-ing this arithmetic are less sensitive to overclocking, becausetiming errors only affect the least significant digits (LSDs).To support this hypothesis, we propose probabilistic modelsof error for an online multiplier. We back up the models withexperimental results from an image processing application.We demonstrate that our novel design methodology can leadto substantial performance benefits compared to the designmethod using conventional arithmetic. A summary of themain contributions of this paper are as follows:

∙ to our knowledge, the first“overclocking friendly”com-puter arithmetic,

∙ probabilistic models of overclocking errors for a digit-parallel online multiplier,

∙ analytical and empirical results which demonstrate thatthe impact of timing violations is ameliorated with on-line arithmetic and that large performance benefits canbe achieved.

2. BACKGROUND2.1 Online ArithmeticIn conventional arithmetic, results are generated either

from the least significant digit, e.g. addition and multiplica-tion, or from the MSD, e.g. division and square root. Thisinconsistency in computing directions can result in large la-tency when propagating data among different operations.To tackle this problem, online arithmetic has been proposedand studied over the past few decades [4, 11]. Both theinputs and outputs are processed in a MSD-first manner us-ing online arithmetic, enabling parallelism among multipleoperations. Thus the overall latency can be greatly reduced.Originally online arithmetic was designed for digit-serial

operation. To generate the first output digit, 𝛿 digits ofinputs are required, as illustrated in Figure 1. 𝛿 is knownas the online delay, which is a constant, independent of theprecision in a given operation. For ease of discussion, theinput to our circuit is normalized to a fixed point number inthe range (−1, 1). Based on this premise, the online repre-sentations of 𝑁 -digit operands and result at iteration 𝑗 aregiven by (1) for 𝑗 ∈ [−𝛿,𝑁 − 1] and 𝑟 denotes the radix [5].

Input X[j] x1

Input Y[j]

x2 x3 x4 x5 xN

y1 y2 y3 y4 y5 yN

Output Z[j] z1 z2 z3 zN

MSD LSD

0 0 00 0 0

δ MSD LSD

Figure 1: Dataflow in digit-serial Online Arithmetic.

𝑋[𝑗] =

𝑗+𝛿∑𝑖=1

𝑥𝑖𝑟−𝑖, 𝑌[𝑗] =

𝑗+𝛿∑𝑖=1

𝑦𝑖𝑟−𝑖, 𝑍[𝑗] =

𝑗∑𝑖=1

𝑧𝑖𝑟−𝑖 (1)

MSD-first operation in online arithmetic is possible be-cause they adopt a redundant number system [2], in whicheach digit is represented with a redundant digit set {−𝑎, ⋅ ⋅ ⋅ ,− 1, 0, 1, ⋅ ⋅ ⋅ , 𝑎}, where 𝑎 ∈ [𝑟/2, 𝑟 − 1]. Due to the redun-dancy, the MSDs of the result can be calculated only basedon partial information of the inputs and then the value ofthe number can be revised by the following digits.

2.2 Radix-2 Digit-parallel Online AdderAdders serve as a critical building block for arithmetic op-

erations. To perform digit-parallel online addition, a redun-dant adder can be directly utilized. The adder structure dia-gram is shown in Figure 2. A major advantage of the redun-dant number system over the standard ripple-carry basedarithmetic is that the propagation of carry is eliminated,resulting in a precision-independent computation time foraddition. As seen in Figure 2, the computation delay ofthis adder is only 2 full adder (FA) delays for any operandword-length, with the cost of one extra FA for each digit ofoperands. This makes the online adder suitable for buildingup more complex arithmetic operators such as multipliers toaccelerate the sum of partial products [8].

2.3 Radix-2 Digit-parallel Online MultiplierMultiplication is another key arithmetic operator. Typ-

ically the online multiplication is performed in a recursivedigit-serial manner, as illustrated in Algorithm 1 [11] whereboth inputs and outputs are 𝑁 -digit number as given in (1).

FA

x1+ x1

- y1+

FA

FA

FA

FA

FA

y1- x2

+ x2- y2

+ y2- xN

+ xN- yN

+ yN-

Cin1 Cin2

z2+ z2

- z3+ zN

- zN+1+ zN+1

-

Binary Bits:

Digits: x1 y1 x2 y2 xN yN

z2 zN+1

MSD LSD

Critical Path

z1+ z1

-

z1

Figure 2: An 𝑁-digit binary digit-parallel onlineadder with 𝑁 + 1 -digit outputs.

Algorithm 1 Online Multiplication

Initialization: 𝑋[−𝛿] = 𝑌[−𝛿] = 𝑃[−𝛿] = 0Recurrence: 𝑓𝑜𝑟 𝑗 = −𝛿, − 𝛿 + 1, ⋅ ⋅ ⋅ , 𝑁 − 1 𝑑𝑜

𝐻[𝑗] = 𝑟−𝛿(𝑥𝑗+𝛿+1 ⋅ 𝑌[𝑗+1] + 𝑦𝑗+𝛿+1 ⋅𝑋[𝑗]

)𝑊[𝑗] = 𝑃[𝑗] +𝐻[𝑗]

𝑧𝑗 = 𝑠𝑒𝑙(𝑊[𝑗])𝑃[𝑗+1] = 𝑟

(𝑊[𝑗] − 𝑧𝑗

) (2)

For a given iteration 𝑗, the product digit 𝑧𝑗 is generatedthrough a selection function 𝑠𝑒𝑙(). For the radix 𝑟 and achosen digit set, there exits an appropriate selection methodand a value of 𝛿 which ensure convergence [11]. As radix-2 is used most commonly in computer arithmetic, we keep𝑟 = 2 throughout this paper with the corresponding redun-dant digit set {1, 0, 1}. In this case we have 𝛿 = 3, and 𝑠𝑒𝑙()is given by (3) [6]. It can be seen that the election is madeonly based on 1 integer bit and 1 fractional bit of 𝑊[𝑗].

𝑠𝑒𝑙(𝑊[𝑗]) =

⎧⎨⎩1 if 𝑊[𝑗] ⩾ 1

2

0 if − 12⩽ 𝑊[𝑗] <

12

1 if 𝑊[𝑗] < − 12

(3)

Algorithm 1 can be synthesized into a unrolled digit par-allel structure as shown in Figure 3(a). In an 𝑁 -digit on-line multiplier (OM), there are a total of 𝑁 + 𝛿 stages, ofwhich the online inputs are generated from the appendinglogic according to (1). In each stage, an online adder isused as shown in Figure 3(b). The SDVM module per-forms the Signed-Digit Vector Multiplication. When 𝑟 = 2and the digit set {1, 0, 1} is used, the implementation ofSDVM is straight forward, because for instance the outputof (𝑦𝑗+𝛿+1 ⋅𝑋[𝑗]) is 0, 𝑋[𝑗] or −𝑋[𝑗] when 𝑦 is equal to 0, 1and −1 respectively.Instead of duplicating the digit-serial implementation 𝑁+

𝛿 times, each stage in the digit-parallel architecture can beoptimized for area reduction. For instance, the SDVM mod-ules and the appending logic are not required in the last 𝛿stages because the inputs are 0. This leads to a smaller on-line adder in these stages. Similarly for the first 𝛿 stages theselection logic can be removed, as the first digit of the resultis generated at stage 0 (𝑆0).

3. PROBABILISTIC MODEL OFOVERCLOCKING ERROR

As the delay of online adder is small, the occurrence oftiming violation requires very fast frequency. Hence we only

Stage -δ

Append Logic

Stage -δ+1

x1

y1

y2

X[-δ] Y[-δ]

Y[-δ+1]

Y[-δ+2]X[-δ+1]

Append Logic

Append Logic

P[-δ]

P[-δ+1]

P[-δ+2]

z-δ

z1-δ

Main Propagation Chain

SDVM SDVM

Online Adder

P & Shift

yj+δ+1

zj

xj+δ+1

P[j+1]

W[j]

P[j]Y[j+1] X[j]

Stage j

Selection Logic

(a) (b)

j+δ+2 j+δ+1

j+2δ+22

j+2δ

j+2δ+1

j+2δ j+δ+1j+δ+2

(Sj)

μ

Figure 3: (a) Synthesis of Algorithm 1 into a digit-parallel online multiplier (b) Structure of one stagewith the maximum digit widths of signals labeled.

model the overclocking error in the radix-2 OM. From Fig-ure 3 we observe two types of delay chains. One is caused bygeneration and propagation of 𝑃[𝑗] among different stages.The other is the generation of online inputs 𝑋[𝑗] and 𝑌[𝑗].Since the appending logic is basically wires and simple com-binational logic [3], the overall latency will eventually bedetermined by the delay of the 𝑃[𝑗] path, especially with in-creasing operand word-lengths. As such, we initially modelthe delay of each stage within an OM to be a constant value𝜇, as shown in Figure 3(b). We also assume that the gener-ation of online inputs costs no delay.Let 𝜇𝑂𝑀 denote the worst-case delay of an OM. It fol-

lows that if the clock period 𝑇𝑆 is greater than 𝜇𝑂𝑀 , cor-rect results will be sampled. If, however, faster-than-ratedsampling frequencies are applied such that 𝑇𝑆 < 𝜇𝑂𝑀 , tim-ing violations might happen and intermediate results will besampled, potentially generating errors. For a given 𝑇𝑆 , themaximum length of error-free propagation is described by(4) where 𝑓𝑆 denotes the sampling frequency. We alwaysensure that the first digit of the product will be generatedcorrectly, i.e. 𝑏 > 𝛿.

𝑏 :=

⌈𝑇𝑆𝜇𝑂𝑀

⌉=

⌈1

𝜇𝑂𝑀 ⋅ 𝑓𝑆

⌉(4)

We now determine when this timing constraint is not metand the size of error in this case. We assume in the modelsthat every digit of input signal to our circuit is uniformly andindependently generated with the digit set {1, 0, 1}. How-ever this assumption will be relaxed in Section 4 where realimage data are used.

3.1 Probability of Timing ViolationsFor an 𝑁 -digit OM, let a propagation chain be generated

at stage 𝑆𝜏 with the length of 𝑑(𝜏) digits, then the stagenumber 𝜏 is bounded by (5). The presence of timing vio-lation requires 𝑑(𝜏) > 𝑏. Also the chain cannot propagateover 𝑆𝑁−1. Thus the bound of 𝑑(𝜏) is given by (6).

−𝛿 ⩽ 𝜏 ⩽ 𝑁 − 1− 𝑏 = 𝜏𝑚𝑎𝑥 (5)

𝑏 < 𝑑(𝜏) ⩽ 𝑁 − 1− 𝜏 (6)

However, the actual length of a “carry” chain is dependentupon input patterns. We then examine the relationship be-

tween 𝑑(𝜏) and the inputs that corresponds to the genera-tion, propagation and annihilation of this carry chain. Letthe specific input pattern of stage 𝑆𝜏 be represented by 𝐶(𝜏),which can be classified into four types, as listed in (7). Un-der the assumption that all digits are mutually independentand uniformly distributed, the probability of 𝐶1 ∼ 𝐶4 equalto 1/9, 4/9, 2/9 and 2/9 respectively.

𝐶(𝜏) =

⎧⎨⎩Case 1 (𝐶1) : 𝑥𝜏+𝛿+1 = 0, 𝑦𝜏+𝛿+1 = 0Case 2 (𝐶2) : 𝑥𝜏+𝛿+1 ∕= 0, 𝑦𝜏+𝛿+1 ∕= 0Case 3 (𝐶3) : 𝑥𝜏+𝛿+1 ∕= 0, 𝑦𝜏+𝛿+1 = 0Case 4 (𝐶4) : 𝑥𝜏+𝛿+1 = 0, 𝑦𝜏+𝛿+1 ∕= 0

(7)

The classification is based on whether a digit is zero, asall internal signals are reset to zero initially. Under thispremise, (2) can be modified to give (8) for 𝜏 > −𝛿. If𝐶(𝜏) = 𝐶1, no chain will be generated at 𝑆𝜏 as 𝑃[𝜏+1] = 0.

𝑃[𝜏+1] = 𝑟 ⋅𝑊[𝜏 ]

= 𝑟−𝛿+1(𝑥𝜏+𝛿+1𝑌[𝜏+1] + 𝑦𝜏+𝛿+1𝑋[𝜏 ])(8)

In other cases, 𝑃[𝜏+1] ∕=0 and a new chain will be generatedat 𝑆𝜏 . We denote the number of the leading non-zero digitsof 𝑃[𝜏+1] to be 𝐷. When 𝑃[𝜏+1] propagates to the nextstage, from (2) we notice that 𝐷 = 𝐷−1, because of the1-digit shifting when computing 𝑃[𝜏+1] = 𝑟(𝑊[𝜏 ]− 𝑧𝜏 ). Thisprocess is independent of inputs. Finally, this chain will beannihilated when 𝐷= 1. In summary, chain length 𝑑(𝜏) isrelated to the value of 𝐷, which we consider separately for𝐶(𝜏) = 𝐶2, 𝐶3 and 𝐶4.If 𝐶(𝜏) = 𝐶2, 𝐷 is the maximum word-length of 𝑃[𝜏+1] as

shown in Figure 3(b), i.e. 𝐷 = 𝜏 + 2𝛿 + 1. Combining 𝐷and (6) gives the maximum chain length in (9).

𝑑(𝜏)𝑚𝑎𝑥 =𝑀𝑖𝑛(𝐷,𝑁 − 1− 𝜏)=𝑀𝑖𝑛(𝜏 + 2𝛿 + 1, 𝑁 − 1− 𝜏) (9)

For 𝐶3 and 𝐶4, the carry generation, propagation andannihilation behavior of both cases are identical. Hence weonly discuss 𝐶(𝜏) = 𝐶3, where (8) is modified to give (10).We have 𝑌[𝜏+1] = 𝑌[𝜏 ] as 𝑦𝜏+𝛿+1 = 0. Thus 𝐷 only dependson the number of the leading non-zero digits of 𝑌[𝜏 ].

𝑃[𝜏+1] = 𝑟−𝛿+1(𝑥𝜏+𝛿+1𝑌[𝜏+1]) = 𝑟

−𝛿+1(𝑥𝜏+𝛿+1𝑌[𝜏 ]) (10)

For the first stage, we have the stage number 𝜏 = −𝛿.Substituting this into (10) yields 𝑃[−𝛿+1] = 𝑟−𝛿+1(𝑥1𝑦1).Therefore 𝑑(−𝛿) = 𝑑(−𝛿)𝑚𝑎𝑥 as given in (9), and it is ob-tained only when 𝐶(−𝛿) = 𝐶2, otherwise 𝑑(−𝛿) = 0.If the carry chain is generated from other stages where

𝜏 > −𝛿, 𝐷 and 𝑑(𝜏) can be determined from 2 possiblesituations for 𝜏 ′ = 𝜏 − 1 as stated below:∙ If 𝑦𝜏 ′+𝛿+1 ∕= 0, 𝐷 equals to the maximum word-lengthof 𝑌[𝜏 ′+1]. Thus we have 𝑑(𝜏) = 𝑑(𝜏

′)𝑚𝑎𝑥 − 1.∙ If 𝑦𝜏 ′+𝛿+1 = 0, 𝑌[𝜏 ′] = 𝑌[𝜏 ′−1]. This is a recursiveprocess and these two judgements will be performedagain for 𝜏 ′′ = 𝜏 ′ − 1. It terminates when the firstjudgement is satisfied or the first stage is reached.

The entire process can be summarized into Algorithm 2,which examines all possible stages that timing violationsmay occur, and all possible input patterns for each stage.This algorithm returns 𝑃𝑟(𝑇𝑆), which denotes the proba-bility that timing violations happen in an 𝑁 -digit onlinemultiplier as a function of 𝑇𝑆 . In this algorithm, 𝑏 and 𝜏𝑚𝑎𝑥

are obtained from (4) and (5). The probability of the input

pattern for stage 𝑖 is denoted by 𝑃𝑟(𝐶(𝑖)), which can bedetermined through (7).

Algorithm 2 Probability of Timing Violations

1: 𝑃𝑟(𝑇𝑆)← 02: for all stages 𝜏 ∈ {−𝛿,−𝛿 + 1, ⋅ ⋅ ⋅ , 𝜏𝑚𝑎𝑥} and all inputcases 𝐶(𝜏) ∈ {𝐶1, ⋅ ⋅ ⋅ , 𝐶4} do

3: Determine 𝑑(𝜏) based on 𝐶(𝜏)4: if 𝑑(𝜏) > 𝑏 then5: 𝑃𝑟(𝑇𝑆)← 𝑃𝑟(𝑇𝑆) +∏𝜏

𝑖=−𝛿 𝑃𝑟(𝐶(𝑖))6: end if7: end for8: return 𝑃𝑟(𝑇𝑆)

3.2 Magnitude of Overclocking ErrorIn the presence of timing violation, multiple chains might

not be correctly propagated, resulting in overclocking errorsgenerated from LSDs. We now locate the first output digit𝑧𝜆 that contains error. For a given 𝑇𝑆 and a minimum valueof 𝜏 such that 𝑑(𝜏)𝑚𝑎𝑥 > 𝑏, we have 𝜆 = 𝜏 +𝑑(𝜏)𝑚𝑎𝑥+1 forchain annihilation. As such, the magnitude of overclockingerror can be expressed by (11), where 𝑧𝑖 and 𝑧𝑖

′ denote thecorrect value and the actual value of the output digit of 𝑆𝑖,separately, and 𝜀𝑖 ∈ {±2−𝑖,±2−𝑖+1} since 𝑧 is representedusing digit set {1, 0, 1}.

∣𝜀∣ =∣∣∣∣∣

𝑁∑𝑖=𝜆

2−𝑖(𝑧𝑖 − 𝑧𝑖′)∣∣∣∣∣ =

∣∣∣∣∣𝑁∑

𝑖=𝜆

𝜀𝑖

∣∣∣∣∣ (11)

However, the changing of the product digits may not resultin a different number as it is represented in a redundantform. For instance, the two’s complement number 0.111 canbe represented in the online form as 0.101, 0.011 and 0.111.This will lead to a smaller error magnitude in practice. Incontrast, errors will always be generated with conventionalarithmetic if the output digits are changed, because eachnumber has a unique representation.

3.3 Expectation of Overclocking ErrorFor a given 𝑇𝑆 , the expectation of overclocking error can

be obtained by combining 𝑃𝑟(𝑇𝑆) from Algorithm 2 and∣𝜀∣ from (11), as presented in (12). We first verify the pro-posed model against the results from Monte-Carlo simula-tions based on the aforementioned assumptions on timingand inputs, as shown in Figure 4(a) and Figure 4(b). 𝑇𝑆 isnormalized with respect to the value from structural timinganalysis, i.e. (𝑁+𝛿)𝜇. It can be seen that the modeled valuematch well with the simulation results. We also verify themodels against the FPGA results without timing assump-tions, as illustrated in Figure 4(c) and Figure 4(d). Wenotice that the general behavior of the model is similar tothe FPGA implementation, however it does not capture thesmall increments in errors because we only model the coarsedelay which is in units of 𝜇. Whereas the other sources ofdelay, such as the generation of the online inputs and resultdigits, are not modeled.

𝐸𝑜𝑣𝑐 = 𝑃𝑟(𝑇𝑆) ⋅ ∣𝜀∣ (12)

From Algorithm 2, instead of accumulating 𝑃𝑟(𝑇𝑆), theprobability of each individual chain delay 𝑃𝑟(𝑑) are shown inFigure 5, together with the corresponding error magnitude∣𝜀(𝑑)∣ and the product of both 𝐸(𝑑), where parameter 𝑑

0.45 0.5 0.55 0.6 0.6510

−3

10−2

10−1

Normalized Ts

Mea

n R

elat

ive

Err

or (

%)

Monte−CarloModel

(a) 8-digit

0.3 0.35 0.4 0.45 0.5 0.55 0.610

−5

10−4

10−3

10−2

10−1

Normalized Ts

Mea

n R

elat

ive

Err

or (

%)

Monte−CarloModel

(b) 12-digit

350 400 450 50010

−7

10−6

10−5

10−4

10−3

10−2

10−1

Frequency (MHz)

Err

or E

xpec

tatio

n

FPGAModel

(c) 8-digit

280 300 320 340 360 380 400 420 44010

−8

10−6

10−4

10−2

Frequency (MHz)

Err

or E

xpec

tatio

n

FPGAModel

(d) 12-digit

Figure 4: Expectation of overclocking error for on-line multipliers: verification of the proposed modelagainst Monte-Carlo simulations with timing as-sumptions (top row) and against FPGA results withreal timing information (bottom row).

represents the chain delay. For a given 𝑇𝑆 , the overall errorexpectation in (12) can also be obtained from (13).

𝐸𝑜𝑣𝑐 =∑𝑑>𝑏

𝑃𝑟(𝑑) ⋅ ∣𝜀(𝑑)∣ =∑𝑑>𝑏

𝐸(𝑑) (13)

From Figure 5 several observations can be made. First,the error magnitude decreases exponentially with longer chainlengths, because timing violation firstly affects LSDs withonline arithmetic. Second, interestingly we see that carrychains with longer delay would happen with greater proba-bilities in an OM. This is because for the unrolled radix-2OM, 𝑑(𝜏) is only dependent upon the inputs for chain gen-eration. Thus long chain will be generated as long as bothinput digits are not zeros with the probability of 4/9. Com-pared to the traditional arithmetic, carry generation, prop-agation and annihilation are all decided by specific inputpatterns. This limits the overall probability of long chains.However, we also notice that for chains with long delays,the increasing in likelihood of error is outweighed by the de-crease in magnitude of error. Therefore the combination ofboth results in a decline in error expectation. In contrast,one would expect error expectation to grow faster as theamount of overclocking is increased when using traditionalarithmetic because errors occur in MSBs.

4. CASE STUDY: IMAGE FILTER4.1 Experimental SetupThe benefits of the proposed methodology are demon-

strated by using a 3 × 3 Gaussian image filter, which con-tains 9 multipliers and 8 adders, with two types of com-puter arithmetic. One is the standard binary arithmeticwith data represented using 2’s complement form. In thiscase, speed-optimized adders and multipliers are created us-ing Xilinx Core Generator [13]. The other is the radix-2 on-line arithmetic, with the basic building blocks as described

5 5.5 6 6.5 710

−3

10−2

10−1

100

Chain Delay: d (digits)

Val

ue

ProbabilityError MagnitudeError Expectation

(a) 8-digit

5 6 7 8 910

−5

10−4

10−3

10−2

10−1

100


Val

ue


(b) 12-digit

5 6 7 8 9 10 1110

−6

10−5

10−4

10−3

10−2

10−1

100


Val

ue


(c) 16-digit

5 10 15 2010

−12

10−10

10−8

10−6

10−4

10−2

100


Val

ue


(d) 32-digit

Figure 5: Probabilities of different chain delay, thecorresponding magnitude of overclocking error andthe combination of both in terms of error expecta-tion in a radix-2 OM.

in Section 2.2 and Section 2.3. In this case, all inputs andoutputs are represented in the online format. In order toachieve the desired latency between input and output, bothdesigns are overclocked and the errors seen at the output arerecorded. The results are obtained from a Xilinx Virtex-6FPGA through post place-and-route simulations. The re-sults are evaluated in terms of mean relative error (MRE),which represents the percentage of error at outputs, as givenby (14) where 𝐸𝑒𝑟𝑟𝑜𝑟 and 𝐸𝑜𝑢𝑡 refer to the mean value of er-ror and correct output, respectively.

𝑀𝑅𝐸 =

∣∣∣∣𝐸𝑒𝑟𝑟𝑜𝑟

𝐸𝑜𝑢𝑡

∣∣∣∣× 100% (14)

In our experiments, two types of input data are utilized.One is randomly sampled from a uniform distribution of 𝑁 -digit numbers, as used in the models. This type is referredto as“Uniform Independent (UI) inputs”. The other is called“real inputs”, which are the pixel values of several 512× 512benchmark images.

4.2 Quantify the Impact of OverclockingThe MRE values of the image filter with traditional arith-

metic (dotted lines) and online arithmetic (solid lines) when𝑁 = 8 are illustrated in Figure 6. Note that we plot the“actual”maximum frequencies which are obtained by exper-imentally increasing the operating frequency from the ratedvalue until errors are observed at the outputs. This is to re-move the conservative timing margin for a fairer comparisononce overclocking is introduced.According to the timing report, the rated operating fre-

quencies of the two designs are 168.7MHz and 148.3MHz,respectively. However as seen in Figure 6, for the UI in-puts, design with online arithmetic actually operates at ahigher frequency without timing violations in comparisonto the design using traditional arithmetic. This error-freefrequency is even larger when using the real image data as

210 220 230 240 250 260 270 28010

−3

10−2

10−1

100

101

Frequency (MHz)

Mea

n R

elat

ive

Err

or (

%)

Online: UI DataOnline: LenaTraditional: UI DataTraditional: Lena

14% Speedup

4% Speedup

Figure 6: Overclocking error in an image filter withtwo types of computer arithmetic: online arithmeticand standard binary arithmetic, of which the ratedfrequencies are 148.3MHz and 168.7MHz, respec-tively, according to the timing analysis tool.

inputs, since the real data do not exactly follow the uniformdistribution or the independent assumption. In Figure 6 the“Lena” benchmark image is used as the real inputs.If errors is tolerable, we may allow timing violations to

happen for better performance. The sensitivity of overclock-ing for a given arithmetic can be evaluated by the data slopein Figure 6. For instance under an error budget of 1% MRE,using UI inputs the frequency speedup of the traditional de-sign is 3.89% with respect to the maximum frequency with-out errors, whereas the design with online arithmetic can beoverclocked by 6.85%. This indicates that online arithmeticis less sensitive to overclocking, as discussed in Section 3.3.The difference is greater using real image data: 13.74% fre-quency speedup with online arithmetic against 4.04% withtraditional arithmetic, because longer chains happen with asmaller probability with real inputs.The output images for both design scenarios are presented

in Figure 7. Since the overclocking errors are in the LSDsof the results with online arithmetic, the degradation on theimage can be hardly observed. In contrast, timing violationscause error in the MSDs with traditional arithmetic. Thisleads to severe quality loss as shown on the images in theright column of Figure 7. Meanwhile, errors in the MSDsresult in large noise power and therefore the signal-to-noiseratio (SNR) of the traditional design is small.We also perform experiments using other benchmark im-

ages. The results are summarized in Table 1 and Table 2in terms of the relative reduction of MRE and the improve-ments of SNR, respectively. In both tables the frequencyis normalized to the maximum error-free frequency for eacharithmetic. From Table 1, a significant reduction of MREcan be observed using online arithmetic. The geometricmean reduction of MRE is 89.2% using UI data. Even largerreductions of MRE can be achieved when testing with realimage data, varying from 97.3% to 98.2%, as expected giventhe results shown in Figure 6. Similarly from Table 2, theimprovements in SNR are 21.4𝑑𝐵 ∼ 43.9𝑑𝐵.The area comparison between two designs is illustrated

in Table 3. While we notice that our approach comes atsome increase in area, the FPGA architecture is optimizedfor conventional arithmetic. For instance, the Virtex seriesemploys dedicated multiplexers and encoders for very fastripple carry addition [12]. Besides, at least 2 orders of mag-

(a) 1.05𝑓0, SNR=38.3dB (b) 1.05𝑓0′, SNR:21.5dB

(c) 1.15𝑓0, SNR:36.2dB (d) 1.15𝑓0′, SNR:9.0dB

(e) 1.25𝑓0, SNR:26.7dB (f) 1.25𝑓0′, SNR:8.5dB

Figure 7: Output images of image filter using onlinearithmetic (left column) and traditional arithmetic(right column), where 𝑓0 and 𝑓0

′ denote the maxi-mum error-free frequencies for each design.

nitude error reduction is obtained for a given frequency inour design, as shown in Figure 6. Although the differencesin both error and area can be compensated by using moredigits with traditional arithmetic, this will result in an evenlarger gap in frequency between two designs.

Table 1: Relative Reduction of MRE with OnlineArithmetic for Various Normalized Frequencies.

InputsNormalized Frequency Geo.

Mean1.05 1.10 1.15 1.20 1.25

Uniform 94.5% 89.1% 90.1% 88.3% 84.3% 89.2%

Lena 99.3% 99.2% 98.9% 97.7% 94.9% 97.9%

Pepper 99.7% 98.3% 98.1% 97.2% 95.1% 97.7%

Sailboat 99.5% 97.9% 97.3% 96.8% 95.1% 97.3%

Tiffany 99.9% 97.6% 98.4% 97.8% 97.2% 98.2%

5. CONCLUSIONIn this paper, we have studied the probabilistic behavior of

key arithmetic primitives with different computer arithmeticwhen operating beyond the deterministic region. Throughthe usage of analytical models and empirical FPGA results,we have demonstrated that significant error reduction andperformance improvement can be achieved by using the“over-clocking friendly” online arithmetic.

6. ACKNOWLEDGMENTS

Table 2: Improvement of SNR (dB) with OnlineArithmetic for Various Normalized Frequencies.

InputsNormalized Frequency

1.05 1.10 1.15 1.20 1.25

Lena 44.6 36.3 33.2 29.1 22.9Pepper 35.7 28.3 28.7 25.9 24.1Sailboat 33.7 27.5 26.3 25.0 21.7Tiffany 43.9 25.9 29.5 25.6 24.5

Table 3: FPGA Resource Usage Comparison.

MetricArithmetic Type

OverheadTraditional Online

Look-Up Tables 912 1896 2.08Slices 324 525 1.62

This work is supported by the EPSRC (Grants EP/I020557/1and EP/I012036/1).

7. REFERENCES[1] T. Austin, D. Blaauw, T. Mudge, and K. Flautner.

Making typical silicon matter with razor. IEEE Trans.on Computer, 37(3):57–65, 2004.

[2] A. Avizienis. Signed-digit number representations forfast parallel arithmetic. IRE Trans. on ElectronicComputers, EC-10(3):389–400, 1961.

[3] M. Ercegovac and T. Lang. On-the-fly conversion ofredundant into conventional representations. IEEETrans. on Computer, C-36(7):895–897, 1987.

[4] M. D. Ercegovac. On-line arithmetic: An overview. InProc. Annual Technical Symp. Real time signalprocessing VII, pages 86–93, 1984.

[5] M. D. Ercegovac and T. Lang. Digital arithmetic.Morgan Kaufmann, 2003.

[6] R. Galli and A. Tenca. A design methodology fornetworks of online modules and its application to thelevinson-durbin algorithm. IEEE Trans. on VLSI,12(1):52–66, 2004.

[7] V. Gupta, D. Mohapatra, A. Raghunathan, andK. Roy. Low-power digital signal processing usingapproximate adders. IEEE Trans. on CAD ofIntegrated Circuits and Systems, 32(1):124–137, 2013.

[8] Y. Harata, Y. Nakamura, H. Nagase, M. Takigawa,and N. Takagi. A high-speed multiplier using aredundant binary adder tree. IEEE Jourl. ofSolid-State Circuits, 22(1):28–34, 1987.

[9] P. Kulkarni, P. Gupta, and M. Ercegovac. Tradingaccuracy for power with an underdesigned multiplierarchitecture. In Proc. Int. Conf. on VLSI Design,pages 346–351, 2011.

[10] K. Shi, D. Boland, and G. A. Constantinides.Accuracy-performance tradeoffs on an fpga throughoverclocking. In Proc. Int. Symp. Field-ProgrammableCustom Computing Machines, volume 0, pages 29–36,2013.

[11] K. S. Trivedi and M. Ercegovac. On-line algorithmsfor division and multiplication. IEEE Trans. onComputers, 26:161–167, 1975.

[12] Xilinx. Virtex-6 FPGA configurable logic block userguide, 2009.

[13] Xilinx. LogiCORE IP multiplier v11.2, 2011.

Documents

Datapath Synthesis for Overclocking: Online Arithmetic for ...cas.ee.ic.ac.uk/people/gac1/pubs/KanDAC14.pdf · N +x N y N y N Cin 1 Cin 2 z 2 + z 2-3 N-N+1 N+1-Binary Bits: Digits: