Floating Point Arithmetic

Floating-Point vs. Fixed-Point Numbers

• Fixed point has limitations

x = 0000 0000. 0000 1001₂

y = 1001 0000. 0000 0000₂

Rounding?

Overflow? (x² and y² under/overflow)

• Floating point: represent numbers in two fixed -width
fields: “magnitude” and “exponent”

Magnitude: more bits = more accuracy

Exponent: more bits = wider range of numbers

	±	Exponent	Magnitude
X =

Floating Point Number Representation

• Sign field:

When 0: positive number, when 1, negative

• Exponent:

Usually presented as unsigned by adding an offset

Example: 4 bits of exponent, offset=8

• Magnitude (also called significand, mantissa)

Shift the number to get: 1.xxxx

Magnitude is the fractional part (hidden ‘1’)

Example: 6 bits of mantissa
o Number=110.0101

shift: 1.100101

mantissa=100101
o Number=0.0001011

shift: 1.011

mantissa=011000

Floating Point Numbers: Example

	±	Exponent	Magnitude

Floating Point Number Range

• Range: [-max, -min] U [min, max]
Min = smallest magnitude
Max = largest magnitude

• What happens if:
We increase # bits for exponent?
Increase # bits for magnitude?

Floating Point Operations

• Addition/subtraction, multiplication/division,
function evaluations, ...

• Basic operations
Adding exponents / magnitudes
Multiplying magnitudes
Aligning magnitudes (shifting, adjusting the
exponent)
Rounding
Checking for overflow/underflow
Normalization (shifting, adjusting the exponent)

Floating Point Addition

• More difficult than multiplication!

• Operations:
Align magnitudes (so that exponents are equal )
Add (and round)
Normalize (result in the form of 1.xxx)

Floating Point Adder Architecture

Floating Point Adder Components

• Unpacking
Inserting the “hidden 1”
Checking for special inputs (NaN, zero)

• Exponent difference
Used in aligning the magnitudes
A few bits enough for subtraction
o If 32-bit magnitude adder, 8 bits of exponent, only 5 bits
involved in subtraction
If negative difference , swap, use positive diff
o How to compute the positive diff?

• Pre-shifting and swap
Shift/complement provided for one operand only
Swap if needed

• Rounding
Three extra bits used for rounding

• Post-shifting
Result in the range (-4, 4)
Right shift: 1 bit max
o If right shift
Left shift: up to # of bits in magnitude
o Determine # of consecutive 0’s (1’s) in z, beginning with z₁.
Adjust exponent accordingly

• Packing
Check for special results (zero, under-/overflow)
Remove the hidden 1

Counting vs. Predicting Leading Zeros/Ones


Counting: Simpler but on the critical path	Predicting: More complex architecture

Floating Point Multiplication

• Simpler than floating-point addition
• Operation:
Inputs:
Output =
Sign: XOR
Exponent:
o Tentatively computed as e1+e2
o Subtract the bias (=127) HOW?
o Adjusted after normalization
Magnitude
o Result in the range [1,4) (inputs in the range [1,2) )
o Normalization: 1- or 2-bit shift right, depending on rounding
o Result is 2.(1+m) bits, should be rounded to (1+m) bits
o Rounding can gradually discard bits, instead of one last stage

Floating Point Multiplier Architecture

Note:
Pipelining is
used in
magnitude
multiplier, as
well as block
boundaries

Square-Rooting

• Most important elementary function
• In IEEE standard, specified a basic operation
(alongside +,-,*,/)
• Very similar to division
• Pencil-and-paper method:

Radicand:
Square root :
Remainder

Square Rooting: Example

• Example: sqrt(9 52 41)

• Why double the partial root?
Partial root after step 2 is:
Appending the next digit
Square of which is 1
The term already subtracted
Find q₀ such that is the
max number ≤ partial remainder

• The binary case:
Square of is:

Find q0 such that is ≤ partial
remainder
For the expression becomes (i.e.,
append “01” to the partial root)

Square Rooting: Example Base 2

• Example: sqrt(011101102) = sqrt(118)

Sequential Shift/Subtract Square Rooter Architecture

Other Methods for Square Rooting

• Restoring vs. non-restoring
We looked at the restoring algorithm
(after subtraction, restore partial remainder if the
result is negative)
Non-restoring:
Use a different encoding (use digits {-1,1} instead of
{0,1}) to avoid restoring

• High-radix
Similar to modified Booth encoding multiplication: take
care of more number of bits at a time
More complex circuit , but faster

• Convergence methods
Use the Newton method to approximate the function
approximates
OR
approximates ,
multiply by z to get
Iteratively improve the accuracy
Can use lookup table for the first iteration

Square Rooting: Abstract Notation

Floating point format:
- Shift left (not right)
- Powers of 2 decreasing

Restoring Floating-Point Square Root Calc.

Nonrestoring Floating-Point Square Root Calc.

If final S negative, drop the last ‘1’ in q, and restore the
remainder to the last positive value.

Square Root Through Convergence

• Newton-Rapson method:

Choose

• Example: compute square root of z=(2.4)₁₀

read out from table = 1.5	accurate to
	accurate to
	accurate to
	accurate to

Non-Restoring Parallel Square Rooter

Function Evaluation

• We looked at square root calculation
Direct hardware implementation (binary, BSD, high-radix)
o Serial
o Parallel
Approximation (Newton method)

• What about other functions?
Direct implementation
o Example: can be directly implemented in hardware
(using square root as a sub-component)
Polynomial approximation
Table look-up
o Either as part of calculation or for the full calculation

Table Lookup


Direct table-lookup implementation	Table-lookup with pre-and post-processing

Linear Interpolation Using Four Subinterval

Piecewise Table Lookup

Accuracy vs. Lookup Table Size Trade-off