25
CS241 Computer Organization Spring 2014 Introduction to Floating-Point 1-29-2015

CS241 Computer Organization Spring 2014

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: CS241 Computer Organization Spring 2014

CS241 Computer Organization

Spring 2014Introduction to Floating-Point

1-29-2015

Page 2: CS241 Computer Organization Spring 2014

! Review HW#2 (quiz today) ! Signed & unsigned addition !When to use unsigned variables ! Representing floating point numbers

HW#3 due Tuesday, 2/03, please write answers on sheet

Quiz on 2s-complement & float next Thursday, 2/06

Lab#1 Datalab, includes isNonNegative(x) ■ can be done in 6 ops or fewer

Outline

Page 3: CS241 Computer Organization Spring 2014

Carnegie Mellon

Example: fun with bytes

/* get most significant byte from x */ int get_msb(int x) { /* compute w-8 */ int shift_val = (sizeof(int)-1)<<3; /* arithmetic shift by w-8 */ int xright = x >> shift_val; /* zero all but LSB */ return xright & 0xFF; }

Page 4: CS241 Computer Organization Spring 2014

Carnegie Mellon

Summary Casting Signed ↔ Unsigned: Basic Rules

⬛ Bit pattern is maintained ⬛ But reinterpreted ⬛ Can have unexpected effects: adding or subtracting

2w

⬛ Expression containing signed and unsigned int ▪ int is cast to unsigned!!

Page 5: CS241 Computer Organization Spring 2014

Carnegie Mellon

Unsigned Addition

⬛ Standard Addition Function ▪ Ignores carry output

⬛ Implements Modular Arithmetic s = UAddw(u , v) =

u + v mod 2w

UAddw(u,v) =u + v u + v < 2w

u + v − 2w u + v ≥ 2w# $ %

• • •• • •

uv+

• • •u + v• • •

True Sum: w+1 bits

Operands: w bits

Discard Carry: w bits

UAddw(u , v)

Page 6: CS241 Computer Organization Spring 2014

Carnegie Mellon

Visualizing (Mathematical) Integer Addition

⬛ Integer Addition ▪ 4-bit integers u, v ▪ Compute true sum

Add4(u , v)

▪ Values increase linearly with u and v

▪ Forms planar surface

Add4(u , v)

u

v

Page 7: CS241 Computer Organization Spring 2014

Carnegie Mellon

Visualizing Unsigned Addition

⬛ Wraps Around ▪ If true sum ≥ 2w ▪ At most once

0

2w

2w+1

UAdd4(u , v)

u

v

True Sum

Modular Sum

Overflow

Overflow

Page 8: CS241 Computer Organization Spring 2014

Carnegie Mellon

Two’s Complement Addition

⬛ TAdd and UAdd have Identical Bit-Level Behavior ▪ Signed vs. unsigned addition in C: int s, t, u, v; s = (int) ((unsigned) u + (unsigned) v); t = u + v

▪ Will give s == t

• • •• • •

uv+

• • •u + v• • •

True Sum: w+1 bits

Operands: w bits

Discard Carry: w bits TAddw(u , v)

Page 9: CS241 Computer Organization Spring 2014

Carnegie Mellon

TAdd Overflow

⬛ Functionality ▪ True sum requires w

+1 bits ▪ Drop off MSB ▪ Treat remaining bits

as 2’s comp. integer

–2w –1–1

–2w

0

2w –1

2w–1

True Sum

TAdd Result

1 000…0

1 011…1

0 000…0

0 100…0

0 111…1

100…0

000…0

011…1

PosOver

NegOver

Page 10: CS241 Computer Organization Spring 2014

Carnegie Mellon

Visualizing 2’s Complement Addition

⬛ Values ▪ 4-bit two’s comp. ▪ Range from -8 to +7

⬛ Wraps Around ▪ If sum ≥ 2w–1

▪ Becomes negative ▪ At most once

▪ If sum < –2w–1 ▪ Becomes positive ▪ At most once

TAdd4(u , v)

u

vPosOver

NegOver

Page 11: CS241 Computer Organization Spring 2014

Carnegie Mellon

Characterizing TAdd

⬛ Functionality ▪ True sum requires w+1

bits ▪ Drop off MSB ▪ Treat remaining bits as

2’s comp. integer

TAddw (u,v) =

u + v + 2w−1 u + v < TMinwu + v TMinw ≤ u + v ≤ TMaxwu + v − 2w−1 TMaxw < u + v

#

$ %

& %

(NegOver)

(PosOver)

u

v

< 0 > 0

< 0

> 0

Negative Overflow

Positive Overflow

TAdd(u , v)

2w

2w

Page 12: CS241 Computer Organization Spring 2014

Carnegie Mellon

Arithmetic: Basic Rules⬛ Addition:

▪ Unsigned/signed: Normal addition followed by truncate,same operation on bit level

▪ Unsigned: addition mod 2w ▪ Mathematical addition + possible subtraction of 2w

▪ Signed: modified addition mod 2w (result in proper range)

▪ Mathematical addition + possible addition or subtraction of 2w

⬛ Multiplication: ▪ Unsigned/signed: Normal multiplication followed by truncate,

same operation on bit level ▪ Unsigned: multiplication mod 2w ▪ Signed: modified multiplication mod 2w (result in proper range)

Page 13: CS241 Computer Organization Spring 2014

Carnegie Mellon

Why Should I Use Unsigned?

⬛ Don’t Use Just Because Number Nonnegative ▪ Easy to make mistakes

unsigned i; for (i = cnt-2; i >= 0; i--) a[i] += a[i+1];

▪ Can be very subtle #define DELTA sizeof(int) int i; for (i = CNT; i-DELTA >= 0; i-= DELTA) . . .

⬛ Do Use When Performing Modular Arithmetic ▪ Multiprecision arithmetic

⬛ Do Use When Using Bits to Represent Sets ▪ Logical right shift, no sign extension

Page 14: CS241 Computer Organization Spring 2014

! 5.0 is not 5 ! 1.0 is not 1 ! 0.0 is ? 0

Integers & Floats

Page 15: CS241 Computer Organization Spring 2014

Carnegie Mellon

Floating Point⬛ Background: Fractional binary numbers ⬛ IEEE floating point standard: Definition ⬛ Example and properties ⬛ Rounding, addition, multiplication ⬛ Floating point in C ⬛ Summary

Page 16: CS241 Computer Organization Spring 2014

Carnegie Mellon

Fractional binary numbers⬛ What is 1011.101?

Page 17: CS241 Computer Organization Spring 2014

Carnegie Mellon

• • •b–1.

Fractional Binary Numbers

⬛ Representation ▪ Bits to right of “binary point” represent fractional powers of 2 ▪ Represents rational number:

bi bi–1 b2 b1 b0 b–2 b–3 b–j• • •• • •124

2i–12i

• • •

1/21/41/8

2–j

bk ⋅2k

k=− j

i∑

Page 18: CS241 Computer Organization Spring 2014

Carnegie Mellon

Fractional Binary Numbers: Examples

⬛Value Representation 5-3/4 2-7/8 63/64

⬛Observations ▪ Divide by 2 by shifting right ▪ Multiply by 2 by shifting left ▪ Numbers of form 0.111111…2 are just below 1.0

▪ 1/2 + 1/4 + 1/8 + … + 1/2i + … → 1.0 ▪ Use notation 1.0 – ε

101.11210.11120.1111112

Page 19: CS241 Computer Organization Spring 2014

Carnegie Mellon

Representable Numbers

⬛ Limitation ▪ Can only exactly represent numbers of the form x/2k ▪ Other rational numbers have repeating bit representations

⬛ Value Representation 1/3 0.0101010101[01]…2

1/5 0.001100110011[0011]…2

1/10 0.0001100110011[0011]…2

Page 20: CS241 Computer Organization Spring 2014

Carnegie Mellon

Floating Point⬛ Background: Fractional binary numbers ⬛ IEEE floating point standard: Definition ⬛ Example and properties ⬛ Rounding, addition, multiplication ⬛ Floating point in C ⬛ Summary

Page 21: CS241 Computer Organization Spring 2014

Carnegie Mellon

IEEE Floating Point

⬛ IEEE Standard 754 ▪ Established in 1985 as uniform standard for floating point

arithmetic ▪ Before that, many idiosyncratic formats

▪ Supported by all major CPUs

⬛ Driven by numerical concerns ▪ Nice standards for rounding, overflow, underflow ▪ Hard to make fast in hardware

▪ Numerical analysts predominated over hardware designers in defining standard

Page 22: CS241 Computer Organization Spring 2014

Carnegie Mellon

⬛ Numerical Form: (–1)s M 2E

▪ Sign bit s determines whether number is negative or positive ▪ Significand M normally a fractional value in range [1.0,2.0). ▪ Exponent E weights value by power of two

⬛ Encoding ▪ MSB s is sign bit s ▪ exp field encodes E (but is not equal to E) ▪ frac field encodes M (but is not equal to M)

Floating Point Representation

s exp frac

Page 23: CS241 Computer Organization Spring 2014

Carnegie Mellonexp E0000 -60001 -60010 -5

0011 -4

0100 -3

0101 -20110 -10111 01000 11001 21010 31011 41100 51101 61110 71111 n/a

Float: consider k-bit exponent, k = 4

special cases: exp = 0000 => denormalized numberexp = 1111 => either ±∞ or NaN

all 14 other exps => normalized number

bias = 2(k-1) – 1 = 23 – 1 = 7

(1) if normalized number: E = exp - bias exp to E: subtract the bias E to exp: add the bias

(2) if denormalized number: E = -bias + 1 exp to E: subtract bias and add 1 E to exp: add bias and subtract 1

Page 24: CS241 Computer Organization Spring 2014

Carnegie Mellon

Precisions⬛ Single precision: 32 bits

⬛ Double precision: 64 bits

⬛ Extended precision: 80 bits (Intel only)

s exp frac

s exp frac

s exp frac

1 8 23

1 11 52

1 15 63 or 64

Page 25: CS241 Computer Organization Spring 2014

Carnegie Mellon

Visualization: Floating Point Encodings

+∞−∞

−0

+Denorm +Normalized-Denorm-Normalized

+0NaN NaN