Representation of Numbers

Working with floating point numbers, I came across some weird results.
Here's a sample program.

This is because of the way these numbers are actually represented in memory.

A brief explanation of how binary integers and floating point numbers are represented in memory:-

Signed Binary Integers

We don't use a minus sign to represent negative numbers. We would like to represent our binary numbers with only two symbols, 0 and 1. There are a few ways to represent negative binary numbers. The simplest of these methods is called ones complement, where the sign of a binary number is changed by simply toggling each bit (0's become 1's and vice-versa). This has some difficulties, among them the fact that zero can be represented in two different ways (for an eight bit number these would be 0000 0000 and 1111 1111).

Hence, a method called two's complement notation which avoids the pitfalls of one's complement is used.

To represent an n bit signed binary number the leftmost bit, has a special significance. The difference between a signed and an unsigned number is given in the table below for an 8 bit number.

The value of bits in signed and unsigned binary numbers
	Bit 7	Bit 6	Bit 5	Bit 4	Bit 3	Bit 2	Bit 1	Bit 0
Unsigned	2⁷= 128	2⁶= 64	2⁵= 32	2⁴= 16	2³= 8	2²= 4	2¹= 2	2⁰= 1
Signed	-(2⁷) = -128	2⁶= 64	2⁵= 32	2⁴= 16	2³= 8	2²= 4	2¹= 2	2⁰= 1

Let's look at how this changes the value of some binary numbers

Binary	Unsigned	Signed
0010 0011	35	35
1010 0011	163	-93
1111 1111	255	-1
1000 0000	128	-128

If Bit 7 is not set (as in the first example) the representation of signed and unsigned numbers is the same. However, when Bit 7 is set, the number is always negative. For this reason Bit 7 is sometimes called the sign bit. Signed numbers are added in the same way as unsigned numbers, the only difference is in the way they are interpreted. It means that numbers can be added regardless of whether or not they are signed.

To form a two's complement number that is negative you simply take the corresponding positive number, invert all the bits, and add 1. The example below illustrated this by forming the number negative 11 as a two's complement integer:

11₁₀ = 0000 1011₂
invert -> 1111 0100₂
add 1 -> 1111 0101₂

So 1111 0101 is two's complement representation of -11. We can check this by adding up the contributions from the individual bits

1111 0101₂= -128 + 64 + 32 + 16 + 0 + 4 + 0 + 1 = -11.

The same procedure (invert and add 1) is used to convert the negative number to its positive equivalent. If we want to know what number is represented by 1111 1101, we apply the procedure again

? = 1111 1101₂
invert -> 0000 0010₂
add 1 -> 0000 0011₂

Since 0000 0011 represents the number 3, we know that 1111 1101 represents the number -3.

Floating point numbers

IEEE floating point numbers have three basic components:

1. sign

2. exponent

3.mantissa.

The mantissa is composed of the fraction and an implicit leading digit (explained below). The exponent base (2) is implicit and need not be stored.

The following figure shows the layout for single (32-bit) and double (64-bit) precision floating-point values. The number of bits for each field are shown (bit ranges are in square brackets):

	Sign	Exponent	Fraction	Bias
Single Precision	1 [31]	8 [30-23]	23 [22-00]	127
Double Precision	1 [63]	11 [62-52]	52 [51-00]	1023

The Sign Bit

The sign bit is as simple as it gets. 0 denotes a positive number; 1 denotes a negative number. Flipping the value of this bit flips the sign of the number.

The Exponent

The exponent field needs to represent both positive and negative exponents. To do this, a bias is added to the actual exponent in order to get the stored exponent. For IEEE single-precision floats, this value is 127. Thus, an exponent of zero means that 127 is stored in the exponent field. A stored value of 200 indicates an exponent of (200-127), or 73. Exponents of -127 (all 0s) and +128 (all 1s) are reserved for special numbers.

For double precision, the exponent field is 11 bits, and has a bias of 1023.

The Mantissa

The mantissa, also known as the significand, represents the precision bits of the number. It is composed of an implicit leading bit and the fraction bits.

To find out the value of the implicit leading bit, consider that any number can be expressed in scientific notation in many different ways. For example, the number five can be represented as any of these:

 5.00 × 10⁰
0.05 × 10²
5000 × 10^-3

In order to maximize the quantity of representable numbers, floating-point numbers are typically stored in normalized form. This basically puts the radix point after the first non-zero digit. In normalized form, five is represented as 5.0 × 10⁰.

A nice little optimization is available to us in base two, since the only possible non-zero digit is 1. Thus, we can just assume a leading digit of 1, and don't need to represent it explicitly. As a result, the mantissa has effectively 24 bits of resolution, by way of 23 fraction bits.

Putting it All Together

So, to sum up:

The sign bit is 0 for positive, 1 for negative.
The exponent's base is two.
The exponent field contains 127 plus the true exponent for single-precision, or 1023 plus the true exponent for double precision.
The first bit of the mantissa is typically assumed to be 1. And thus the mantissa becomes 1.f, where f is the field of fraction bits.

Ranges of Floating-Point Numbers

Let's consider single-precision floats for a second. Note that we're taking essentially a 32-bit number and re-jiggering the fields to cover a much broader range. Something has to give, and it's precision. For example, regular 32-bit integers, with all precision centered around zero, can precisely store integers with 32-bits of resolution. Single-precision floating-point, on the other hand, is unable to match this resolution with its 24 bits. It does, however, approximate this value by effectively truncating from the lower end. For example:

      11110000 11001100 10101010 00001111  // 32-bit integer
= +1.1110000 11001100 10101010 x 2³¹     // Single-Precision Float
=   11110000 11001100 10101010 00000000  // Corresponding Value

This approximates the 32-bit value, but doesn't yield an exact representation. On the other hand, besides the ability to represent fractional components (which integers lack completely), the floating-point value can represent numbers around 2¹²⁷, compared to 32-bit integers maximum value around 2³².

Special Values

IEEE reserves exponent field values of all 0s and all 1s to denote special values in the floating-point scheme.

Zero

Zero is not directly representable in the straight format, due to the assumption of a leading 1 (we'd need to specify a true zero mantissa to yield a value of zero). Zero is a special value denoted with an exponent field of zero and a fraction field of zero. Note that -0 and +0 are distinct values, though they both compare as equal.

Denormalized

If the exponent is all 0s, but the fraction is non-zero (else it would be interpreted as zero), then the value is a denormalized number, which does not have an assumed leading 1 before the binary point. Thus, this represents a number (-1)^s × 0.f × 2^-126, where s is the sign bit and f is the fraction. For double precision, denormalized numbers are of the form (-1)^s × 0.f × 2^-1022. From this zero can be interpreted as a special type of denormalized number.

Infinity

The values +infinity and -infinity are denoted with an exponent of all 1s and a fraction of all 0s. The sign bit distinguishes between negative infinity and positive infinity. Being able to denote infinity as a specific value is useful because it allows operations to continue past overflow situations. Operations with infinite values are well defined in IEEE floating point.

Not A Number

The value NaN (Not a Number) is used to represent a value that does not represent a real number. NaN's are represented by a bit pattern with an exponent of all 1s and a non-zero fraction. There are two categories of NaN: QNaN (Quiet NaN) and SNaN (Signalling NaN).

A QNaN is a NaN with the most significant fraction bit set. QNaN's propagate freely through most arithmetic operations. These values pop out of an operation when the result is not mathematically defined.

An SNaN is a NaN with the most significant fraction bit clear. It is used to signal an exception when used in operations. SNaN's can be handy to assign to uninitialized variables to trap premature usage.

Semantically, QNaN's denote indeterminate operations, while SNaN's denote invalid operations.

References and Further Reading

http://en.wikipedia.org/wiki/Floating_point

http://en.allexperts.com/q/C-1587/float-storage.htm

http://steve.hollasch.net/cgindex/coding/ieeefloat.html

http://www.swarthmore.edu/NatSci/echeeve1/Ref/BinaryMath/NumSys.html#float