[IEEE Comput. Soc 16th IEEE Symposium on Computer Arithmetic - Santiago de Compostela, Spain (15-18 June 2003)] 16th IEEE Symposium on Computer Arithmetic, 2003. Proceedings. - An

An Overview of Floating-Point Support and Math Library on the Intel R�

XScale�� Architecture

Cristina Iordache and Ping Tak Peter TangIntel Corporation

[email protected], [email protected]

Abstract

New microprocessor architectures often require softwaresupport for basic arithmetic operations such as divide, orsquare root. The Intel R� XScale�� processor, designed forlow power mobile devices, provides no hardware supportfor floating-point. We show that an efficient software imple-mentation of the basic operations and math library routinescan achieve competitive performance, and effectively hidethe lack of hardware floating-point for most applications.

1. Introduction

The Intel R� XScale�� processor is a 32-bit RISC mi-croarchitecture based on the architecture by Advanced RISCMachines (ARM*). Unlike processors to be used in gen-eral purpose computing such as PC or enterprise servers,the target of the Intel R� XScale�� processors are embed-ded platforms. These include high-end mobile telephones,PDAs, communicators, and wireless Web browsers.

Because of the large number of possible platforms aswell as the need for software environment for product de-velopment on these processors, floating-point support forthe XScale�� processor (which does not have an FP unit)is indispensable. Although floating-point support on inte-ger based processors is not new, the much enhanced perfor-mance and instruction set of this new generation of integerprocessors do bring the whole effort of floating-point sup-port to a new level.

The purpose of this paper is to illustrate that throughcareful utilization of the XScale�� microarchitecture andcustomized algorithms, floating-point performance on theselow-power integer-based processors can rival that of PCs ofjust a few years ago. The outline of the paper is as follows.Section 2 discusses some of the important instructions weused in support of floating point. Section 3 discusses theemulation of IEEE single and double precision basic oper-

*Other brands and names are property of their respective owner

ations. Section 4 discusses the support of a floating-pointrun-time library with emphasis on the algorithmic method-ology as well as software architecture. Section 5 providestiming results of these floating-point components and someoverall observations.

2. The XScale�� Integer Instruction Set

The Intel R�XScale�� processor complies with the ARM�

V5TE architecture specifications. It implements the integerinstruction set of ARM� Version 5, the Thumb instructionset (ARM� V5T), and the ARM� V5E DSP extensions.

The ARM� integer instruction set includes 32-bit logi-cal, arithmetic, and test instructions (see table 1 for a list ofthe most important instructions, and [1] for more completeinformation). The destination register is specified explic-itly in standard ARM mode, and thus does not need to beone of the operand registers. All of these data processinginstructions have an option to update the status flags in theCPSR (Current Program Status Register) according to theresult of the operation. For most instructions, execution canbe conditional based on status flag values, by adding an ap-propriate suffix. For example:

// �� , and update status flags:subs ��, ��, ��// Set �� = 2 if Carry flag=1, Zero flag =0 (i.e. �� ):movhi ��, #2// Set �� if Zero flag = 1 (i.e. �� ):moveq ��, #1// Set �� = 0 if Carry flag =0 (i.e. �� ):movcc ��, #0Conditional execution helps improve performance by elim-

inating branches and branch penalties (where appropriate),and at the same time reduces code size.

Most register-to-register instructions take 1 cycle onXScale�� . The most notable exceptions are the multi-ply and multiply-add instructions, which have latencies be-tween 2 and 6 cycles, depending on the range of the mul-tiplier, and a resource latency (throughput) only one cy-cle shorter than the instruction latency. Long multiply and

Proceedings of the 16th IEEE Symposium on Computer Arithmetic (ARITH'03)1063-6889/03 $17.00 (C) 2003 IEEE

multiply-add instructions that provide the full 64-bit prod-uct of 32-bit operands are helpful in the development offloating-point simulation code.

One distinctive feature of many ARM� V5 instructionsis the ability to use a shift amount for the second operand,e.g.:

// �� :add ��, ��, ��, LSR #6// �� :sub ��, ��, ��, LSL �This eliminates the need for separate shifts and can save

code space, used register space (if the second operand valueis reused), and can improve overall performance.

A related feature is the ability to rotate the carry flag intothe destination register, or one of the source register bits intothe carry flag.

Saving and restoring registers to/from memory is rela-tively expensive and should be avoided, if possible. It takesat least � � � � cycles to save and restore � registers.The software conventions allow 5 scratch registers (��, ��,��, ��, ��); the values of these registers can be modifiedby the called routine, without saving them first. 15 generalpurpose registers are visible at any one time; one of them(��) is defined as the stack pointer.

Table 1. Main Types of ARM� Arithmetic andLogical Instructions

Move �� Move NOT �� 0xffffffff��Add �� Add with carry ��

Subtract �� Subtract with carry �� Reverse SUB ��

Reverse SUB w. carry �� multiply �� [31:0]�� mul.-add �� [31:0]Long signed multiply ��, �� Long unsigned multiply ��, �� Long signed mul.-add ��, �� Long unsigned mul.-add ��, �� Count leading zeroes �� count of leading zeroes in ��Logical AND �� AND ��Logical XOR �� EOR ��Logical OR �� EOR ��Logical Bit Clear �� AND NOT ��Test logical Update flags on �� AND ��Test equivalence Update flags on �� XOR ��Arithmetic compare Update flags on �� - ��Negative compare Update flags on �� + ��

3. Basic FP Operations

The compiler generates a routine call for each basicfloating-point operation (add, subtract, multiply, divide). Thearguments and the result are passed in integer registers, asspecified by software conventions. For single precision,

the operands are passed in �� and ��, and the result is re-turned in ��. For double precision, the operands are passedin �� , �� , and the result is returned in �� .These registers hold the memory representation format ofthe floating-point values used.

Typically, the argument values are unpacked, the signifi-cands and exponents are used in separate computations, andthen combined to form the result at the end. Special cases(infinities, NaNs, denormals, overflow, underflow) are elim-inated from the main path as quickly as possible, and pro-cessed separately. Our main goal was to speed up normalcases.

3.1. The Add/Subtract Routines

The implementations of these simple operations take ad-vantage of the special characteristics of the ARM� architec-ture.

In all normal cases, the sign is determined as the sign ofthe argument larger in absolute value. The arguments canbe swapped if necessary, so that the first one (�) determinesthe sign of the result.

The single precision routine does not need to completelyseparate the argument exponents (which also helps by re-ducing the number of used registers). Given that only thelower 8 bits of a register specify the shift amount, the fol-lowing value, obtained by rotating the initial arguments, canbe used to correctly add the significands:

// �� = �� (�� was �� ):mov ��, ��, ROR #24// �� = �� (��-��) (��-��), �� :rsb ��, ��, ��, ROR #24// test bit 8 of �� (�� ), to see if signs are different:tst ��, #0x100// change sign of significand , if bit 8 is set in ��:rsbne ��, ��, #0// �� :add ��, ��, ��, ASR ��For rounding, the last significand bit (�) and the round

bit (�) are tested by rotation to the carry flag, and the � � �and � � � � � cases are eliminated early. The sticky bit iscomputed only for � � �, � � �.

The double precision routine uses similar techniques.Given that the significands cannot fit in just one 32-bit reg-ister, further speedup is obtained by selecting different codepaths based on the operation (add or subtract), and the ex-ponent difference (shift amount for the second significand).

3.2. Multiply

The single precision multiply routine is based on the long�� multiply instruction (UMULL) and is rela-tively straightforward. The input significands are scaled by


��, so that their product is in the range �� . The out-put significand and the round bit are then obtained by shift-ing right the upper half (upper register) of the product. Thesticky bit is the OR sum of the remaining bits.

The double precision multiply routine uses the long mul-tiply and multiply-add instructions (UMULL, UMLAL) toget the full 106-bit product.

3.3. Divide

3.3.1. Single Precision Divide. The single precision di-vide algorithm uses an 8-bits-in, 8-bits-out reciprocal lookuptable, and computes the leading 25 bits of the quotient in 5iterations, followed by one simple correction.

As usual, the sign and exponent are computed separately,and the significand of the result is obtained by rounding thequotient of the scaled argument significands ��, where� � �� , and � � �� .

The reciprocal approximation�� is such that � �� , where � ��

�� .Given that �� , we have � � ��

�� . The quotient will be computed as�� .

Let �� , and let us assume that after eachiteration, we have �� . (This is true for��.) Then we also have

�� ,where �� is an integer, and �� .

Then��

�� .Since � ��

�� , we have

�� .Now let � � �� , � � ��. The following

iteration, repeated 5 times, computes the quotient to about25 bits of accuracy:

�� AND ��,where ��

�� Before rounding, one more step is needed to ensure that

the leading 25 bits of the quotient are all correct:If �� then �� The exact round bit is bit 25 of the quotient. For normal

cases, round-to-nearest can be performed correctly based onthe round bit value only, given that the quotient cannot fallat a midpoint between two floating-point values of the sameprecision as the argument.

3.3.2. Double Precision Divide. The double precision di-vide algorithm computes 1/b to over 12 bits of accuracywith a bipartite scheme using a 7-bits-in, 16-bits-out table,and a 7-bits-in, 8-bits-out table. 5 iterations followed by a

simple correction are used to compute the first 55 significantbits of the quotient. Bipartite table reciprocal computationschemes were first proposed in [2], [3], [4].

Let � � �� be the significand of the divisorscaled by ��, and let � � �� .

A scaled reciprocal approximation is computed asfollows:

�� , where

��

�� , and

��

�� .

We then have � �� , and � �� .

Let � � �� . Note that � � �� , so it can be represented using 2 32-bit registers.

The quotient significand is computed as � � �� ,in a manner similar to our single precision algorithm:

�� , and we can show that aftereach iteration, the remainder is �� :

�� , �� , and �� .

Then��

�� , so �� .

Note that ��

��

�� . �� is 76 bits long andrequires 3 32-bit registers for representation, but �� can beobtained directly (with no additional shifting) from the reg-ister that holds the leading 12 bits of �� . ��

��

is obtained by shifting the remaining 2 registers 11 positionsto the left, and adding � � �� :

��

��

�� After the fifth iteration, a final correction ensures that the

leading 55 bits of the quotient are all correct:If ��

� � � � ��

then ��

Rounding to nearest is performed based on the value ofthe round bit (bit 54). The sticky bit is not needed, since itis always 1, with the exception of exact cases.

3.3.3. 32-bit Integer Divide. Integer divide is implementedusing a 6-bits-in, 8-bits-out table that provides a short recip-rocal approximation for the divisor. The quotient is com-puted iteratively (about 6 bits per iteration). A special pathis used for short quotients (below ��), which are computed1 bit at a time.

In this case as well, we compute the quotient as � � �� , where � �� is close to ��. �� is obtainedby scaling a table value that approximates �� (the leading 7 bits of B). Given that the table is accurate toat least 5.75 bits, we have

� � �� .Let � � �� , with �� an integer in

�� .


We have � � ��

�� , so clearly � � �� . If weset �� in a first iteration, the remainder is� � ��

��

For any � � �� , the table entries are such that � ��. Given that �� , we get �� .

In a similar manner we can show that after iteration j, theremainder is �� , with �� .

Since � � � and ��

�, ��

does not exceed �� , so �� . This ensures thatthe next remainder is �� . The computation can stop after the fifthiteration.

For � � ��, ��

�� .� � ��, and we can show that after iteration j, the

remainder is �� , with��

��.Then �� , and after the fifth iteration, the remain-

der is ��

(since � � ��

��

�� ).In this case as well, 5 iterations are sufficient to perform

32-bit integer divide.

A long multiply-add is used to get the next remainder ineach iteration:�� The quotient is updated as��

After the fifth iteration, a final correction is applied:if ��

then ��

4. Transcendental Functions

In theory, once the basic floating-point operations areemulated, any high-level language implementation offloating-point run-time library based on floating-point typecan be used to provide the functionality required. However,such an approach has several major drawbacks. First, theperformance will be low. Consider for example a floating-point exponential function calculation based on floating-point type. Every basic operation will be translated to afunction call. Moreover, a good portion of the work done ineach basic floating-point operation emulation is redundantor unnecessary. The floating-point encoding is unpackedand packed at each operations, and most of the effort tomaintain IEEE compliant rounding in the intermediate cal-culations are unnecessary. Second, and less obvious, is thatthe machine’s capability is not fully utilized. The native 32-bit signed integer can carry 31 significant bits of accuracy

while IEEE single precision format only offers 24. Simi-larly, a 64-bit signed integer carries 63 significant bits ofaccuracy while IEEE double precision only carries 53. Be-cause of the high potential in performance advantage, a setof integer-based implementation of floating-point transcen-dental function was developed.

4.1. Methodology

The natural accuracy characteristic of floating-point arith-metic is relative accuracy, while that of integer arithmeticis absolute accuracy. A fundamental question about us-ing integer arithmetic to implement a floating-point func-tion is whether absolute accuracy suffices for a core part ofthe calculation. When absolute accuracy suffices, integercomputation can be applied in a most natural and efficientmanner. In the computation of an elementary transcenden-tal function, one important step is often the computation ofthe function near a special point such as a root. That is, weneed to approximate the underlying transcendental functionby computing a function � of the form

��

��

where ��. The right hand side can be expressed as

��

��

Here �� is generally a leading portion of �� such as thefirst term or two of ��, and is no bigger than the lead-ing exponent of the function �� . With appropriatechoices of �� and , �� needs only to be calculated to aprescribed absolute accuracy in order that the final expres-sion carries enough relative accuracy. This is because themagnitude of � is usually comparable to that of �. �� ap-pears as a scale factor on �, scaling any absolute error thereinto a relative error with respect to � . This kind of decom-position is made in each transcendental function we imple-mented; and we describe the �� functions as our absoluteaccuracy core. The decomposition is by no means unique,and can be traded off with performance and accuracy. Somespecific forms of �� are

��

��

��

��

Closely related to this absolute accuracy core is a fixed-point computation of a polynomial. This is best illustratedby the simple case of computing via Horner’s recurrence:

� � � ��

Suppose the variable � and coefficients are stored in 32-bitintegers

� � ��


for all � and some � � �. Then the following integer com-putation yields the desired expression with an absolute ac-curacy comparable to ��:

� ��

for � � � � �� do� �� high 32-bit��

end do

While the computation of the absolute accuracy core inthis native integer computation sequence is an importantpart, there are other general computations where relativeaccuracy has to be maintained. In general, data items arerepresented as scaled integers. That is, an integer value corresponding to the significand is kept. A scaled factor is carried either explicitly or implicitly. The conventionwe adopt is �� . When two values �� ,�� are operated on, the resulting value �� is ob-tained. For example in the case of multiplication of 32-bit’s and ’s, the high part of the product is retained and theresulting value is �� . This method more orless preserves 30 bits of relative accuracy.

Finally, the algorithms employed are the well known ta-ble driven methods [5]. This class of methods typically sep-arates the function computation process into three stages:argument reduction, core approximation, and final recon-struction.

Argument reduction is performed mostly by integer arith-metic operated on the significand of the input argument. Forexample, in the calculation of ��, we perform the fol-lowing computation.

� � � �

� � ��

where � is the significand of the input value, and � is a31-bit approximation to �� suitably scaled up by apower of 2. The resulting quantities � and � satisfy

��

Hence �� where � � ��. For singleprecision calculation, the table size is usually chosen so thatthe reduced argument � is small enough so that �� fitsin a 32-bit signed integer for some � � �. This will alloweasy integer computation of the absolute accuracy core.

The core approximation consists of a simple sequenceof integer computations followed by general scaled inte-ger computations (that is computation of the significandstogether with explicit computation of scale factors). Thefinal reconstruction usually requires general scaled integercomputations.

4.2. Implementation

The performance advantage of an integer-based imple-mentation of transcendental function library over one thatis based on software emulated basic floating-point opera-tions is quite obvious. However, the implementation effortrequired can vary depending on the strategy employed.

We decided not to use assembly coding because of thelabor intensity in both the initial library creation and itssubsequent maintenance. Moreover, a hierarchical methodis used in that most functions share some basic routines.For example, natural, base-2, and base-10 logarithms usethe same basic underlying routine; inverse sine and inversecosine make use of the inverse tangent routine. However,to maximize performance, we do not use common routinesin the standard sense. Our common “routines” are reallymacros expanded into common code sequences. Thus, thelibrary is modular and call overhead free within each func-tion. The tradeoff is an increased code size, which is deemedacceptable as the code size of a transcendental function li-brary is not big to begin with.

Here is an example of the single-precision natural loga-rithm function:

sgl_cvt_flt2fix(y, sc_y, x);sgl_log(n, z, sc_z, y, sc_y);

... ... ... ...sgl_cvt_fix2flt(y, w, sc_w);

The first macro unpacks an IEEE input x into a scaled in-teger representation. The integers y and sc y are such that�� is the input value. This macro is used in almostevery function in the very beginning. The macro sgl logcomputes the natural logarithm of an input value representedin scaled integer form and return the result in three integersn, z, and sc z where

� ��

Some native integer code then computes the value on theright hand side in scaled integer format. Finally, a packingmacro sgl cvt fix2flt is used to convert the scaledinteger value into an IEEE encoding. Clearly, this packingmacro is used extensively.

The macro sgl log is reused for base-2 and base-10logarithms. We only need to change the calculation follow-ing it to yield the base-2 or base-10 logarithm, respectively.Our experience finds that modularity and reuse greatly en-hance our efficiency in the overall software developmentprocess.

4.3. Example: The Single Precision ExponentialFunction

The computation of the single precision exponential func-tion is based on the mathematical identity � � ��


�� where �� and �� . Thus

��

� ��

where �� is a polynomial that approximates �� .The 8 possible values of �� are computed beforehand andstored in a table. We now outline the key implementationdetails involved.

The input value � is unpacked and represented by� � ��. Because the exponential function has a ratherlimited range of input (as it overflows and underflowseasily), � is quite limited in range.

The constant �� is stored as a 32-bit integer withanother 8-bit extension thus�� . Using the 32-bitlong integer multiplication instruction we compute� � � ��and obtain , � and � such that

� � ��

Moreover, we multiply just a few leading bits of� to �� to obtain a correction � to � so that �� approximates � accurately. We modify � to �� .

Next, a 5-coefficient polynomial is evaluated at � us-ing fixed-point computation: Set � �� , and for � �� , compute � �� , where �� de-notes the high 32 bits of a product. Each of the �� has ascale factor of 31, and thus in the end

��

Finally, �� is fetched from a table where �� .The value � �� is computed.Hence �� . and � are (rounded) and packedinto IEEE single precision format.

4.4. Example: The Double Precision Natural Loga-rithm

The computation of the double precision log is based onthe mathematical identity

�� .In the following, we denote the unbiased exponent of the

double precision input as �, and the mantissa by� � �� .

For arguments sufficiently far from 1.0, the identity aboveis applied twice in our computation:

�� .�� and �� are selected such that � � �� is very close

to 1, and so �� can be approximated by a shortpolynomial of argument � � � � �� .

In particular, for � � �� , we select��

�� . �� is a 6-bit value, and can be ac-cessed from a table indexed by �� , the first 5 man-tissa bits after the leading 1. We choose to store ��

as �� , accurate to at least 70 bits, in a table using thesame index.

Clearly, � � �� .Let � �� , where the binary representation of

�� is �� . We select �� , where�� is ��

��, rounded to the nearest integer. ��,

as well as �� can be read from a table indexed by�� . �� is stored as �� , to an accuracy ofat least 66 bits.

For � � � �� , we how have �� , solog(1+t) can be estimated to at least 61 bits of accuracy bya degree 6 polynomial. This is sufficient to ensure that themaximum error of the final result will be less than 0.55 ulp.The high order terms of the polynomial can be safely esti-mated with �� -bit multiplies, while �� multipliesare used where accuracy is critical.

The sum of polynomial terms is scaled to (P, 75), andadded to the two table values and the input exponent (alsoscaled to 75). The result is then converted from scaled inte-ger to floating point format.

For arguments near 1 (�� ), only a polynomialevaluation as described above is used to get the final result.

5. Timings and Conclusion

Our basic single precision operation implementations(add, subtract, multiply, divide) are about 10 times fasterthan the corresponding GNU implementations, and compa-rable to the ARM� v2.5 implementations, which are alsohighly optimized. Our double precision basic operations are2 to 3 times faster than the ARM� implementations, how-ever, and 10 to 20 times faster than the GNU implementa-tions.

Our square root assembly implementation is about 15times faster than the GNU implementation. In double pre-cision, our sqrt is about 3 times faster than the ARM� sqrt.In single precision, our routine is more than 7 times faster(also due to the fact that ARM� does not provide special-ized single precision math functions, so the single precisionresult is based on a call to the double precision function).

Since all other math functions are implemented in C,their performance is compiler dependent. The timings pro-vided in this section were obtained for a library built withthe Intel Saturn compiler. The compiler generated code isnot optimal, but the Saturn generated code for these func-tions is generally better than ARM� or GNU generated code.An assembly implementation of expf runs in 108 cycles(worst case), while the Saturn generated code for the samealgorithm takes about 167 cycles (relatively close). In dou-ble precision, the current Saturn generated code for exptakes 525 cycles, while less than 250 are sufficient for an as-sembly implementation of the same algorithm. Most singleprecision routines run in 100 to 250 cycles, when the Intel


Saturn compiler is used to build the library. Double pre-cision routines are typically 3 to 4 times slower than theirsingle precision counterparts. We are still working with theSaturn compiler team to improve the performance of theircode. All of our math library routines have maximum ulperrors below 0.55.

We found that when used with the Saturn compiler basicnumerics library, the performance of the GNU libm (whichcalls the basic FP op routines), improved 8 to 10 times. Ourinteger-based libm implementation is still at least 4 timesfaster.

Table 2. Table of latencies

Function Latency [cycles] Latency [cycles]Single Precision Double Precision

Add 34 59Multiply 35 46Divide 57 118SQRT 84 183ASIN 250 520EXP 167 525LOG 215 433SIN 181 607

In conclusion, given the relatively low latencies providedby software floating-point support, non floating-point ex-tensive applications can achieve performance as if floating-point is natively supported in hardware.

References

[1] ARM� Developer Suite, Assembler Guide, ARM Lim-ited, 2000.

[2] D. DasSarma, D.W. Matula, ”Finite Precision Recip-rocal Computation: I. Bipartite Tables, expanded ver-sion of Faithful Bipartite ROM Reciprocal Tables”,Proc. 12th IEEE Symp. Comput. Arithmetic, 1995, pp.17-28.

[3] H. Hassler, N. Takagi, ”Function Evaluation By TableLook-up and Addition”, Proc. 12th IEEE Symp. Com-put. Arithmetic, 1995, pp. 10-16.

[4] N. Takagi, ”Generating a Power of an Operand by aTable Look-up and a Multiplication”, Proc. 13th IEEESymp. Comput. Arithmetic, 1997, pp. 126-131.

[5] Ping Tak Peter Tang, ”Table-driven implementation ofthe logarithm function in IEEE floating-point arith-metic”, ACM Transactions on Mathematical Software,vol. 16, no. 4, December 1990, pp. 378–400.


Documents

[IEEE Comput. Soc 16th IEEE Symposium on Computer Arithmetic - Santiago de Compostela, Spain (15-18 June 2003)] 16th IEEE Symposium on Computer Arithmetic, 2003. Proceedings. - An