CS 223 - Motivation for floating point numbers

Why floating point numbers?

Most computer architecture books describe the IEEE floating point number format in detail, but few try to motivate how that format was arrived at. In this page we use a "floating point" like format for integers as a way to motivate why the IEEE floating point format, with its normalized and denormalized numbers and biased exponents, is structured the way it is.

Here is a problem that motivates some of what we will do. Suppose we have just 5 bits to use to represent positive integers. We can represent at most 32 different positive integers using 5 bits. The usual way to represent positive integers using 5 bits is to use a 5 bit unsigned integer format, which will encode the integers from 0 to 31. But suppose we want to encode a wider range of numbers than the range from 0 to 31? That is the problem that we want to explore, using 5 bits to represent a wider range of integers than the usual "fixed point" unsigned integer format (we will explain soon what we mean by "fixed point" vs. "floating point").

To get a sense of what we will do, consider the number 12,300,000,000,000. It took 14 digits (not counting the "punctuation marks", the commas) to represent that number. But there is an alternative notation for such large numbers. We can write this number, using "scientific notation", as, for example, 123e11, which represents 123*10^(11). So by making use of an exponential notation we can write a 14 digit number using just 5 digits (not counting the "punctuation mark", the e). Notice that this 5 digit format can represent a number as large as 999e99, which has 102 digits if written in the usual decimal format. So five digits, used in an exponential format, can represent numbers of far more than 5 places in the usual decimal format.

Notice one thing though. Not all large integers can be represented by fewer digits when we use scientific notation. The number 12,356,789,876,543 has fourteen digits, and scientific notation is not going to make this number any easier to write. But if we are willing to approximate this number, then the number 124e11 can be considered a reasonable substitute for the original number. In fact, 124e11 will be the "reasonable" approximation for a lot of 14 digit integers (can you determine exactly how many?). We say that the original number 12,356,789,876,543 has 14 significant digits and its approximation 124e11 has only three significant digits.

Now let us create a binary encoding of positive integers that mimics "scientific notation" so that we can encode large integers in a small number of bits. In particular, we shall create a 5 bit encoding that encodes integers from a far larger range than the 0 to 31 range determined by the unsigned integer format.

Let the notation 101e11 be a binary scientific notation for 5*2^(3)=40. The binary number before the e is called the significand and the binary number after the e is called the exponent. Let us use this notation as a way to define an encoding of 5 bit words into positive integers. For any given 5 bit word, let the three most significant bits be the significand and let the two least significant bits be the exponents. Below is a table showing how this encoding encodes 5 bit words into integers. We write the 5 bit words with a comma in them to help distinguish the significand from the exponent. (Notice that we are using 3 bit and 2 bit unsigned integer formats in the signficand and exponent, respectively.)

0 0 0, 0 0  ->  0 * 2^0 =  0
0 0 1, 0 0  ->  1 * 2^0 =  1
0 1 0, 0 0  ->  2 * 2^0 =  2
0 1 1, 0 0  ->  3 * 2^0 =  3
1 0 0, 0 0  ->  4 * 2^0 =  4
1 0 1, 0 0  ->  5 * 2^0 =  5
1 1 0, 0 0  ->  6 * 2^0 =  6
1 1 1, 0 0  ->  7 * 2^0 =  7

0 0 0, 0 1  ->  0 * 2^1 =  0
0 0 1, 0 1  ->  1 * 2^1 =  2
0 1 0, 0 1  ->  2 * 2^1 =  4
0 1 1, 0 1  ->  3 * 2^1 =  6
1 0 0, 0 1  ->  4 * 2^1 =  8
1 0 1, 0 1  ->  5 * 2^1 = 10
1 1 0, 0 1  ->  6 * 2^1 = 12
1 1 1, 0 1  ->  7 * 2^1 = 14

0 0 0, 1 0  ->  0 * 2^2 =  0
0 0 1, 1 0  ->  1 * 2^2 =  4
0 1 0, 1 0  ->  2 * 2^2 =  8
0 1 1, 1 0  ->  3 * 2^2 = 12
1 0 0, 1 0  ->  4 * 2^2 = 16
1 0 1, 1 0  ->  5 * 2^2 = 20
1 1 0, 1 0  ->  6 * 2^2 = 24
1 1 1, 1 0  ->  7 * 2^2 = 28

0 0 0, 1 1  ->  0 * 2^3 =  0
0 0 1, 1 1  ->  1 * 2^3 =  8
0 1 0, 1 1  ->  2 * 2^3 = 16
0 1 1, 1 1  ->  3 * 2^3 = 24
1 0 0, 1 1  ->  4 * 2^3 = 32
1 0 1, 1 1  ->  5 * 2^3 = 40
1 1 0, 1 1  ->  6 * 2^3 = 48
1 1 1, 1 1  ->  7 * 2^3 = 56

Notice that the following 20 integers have representations in this encoding.

0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 20, 24, 28, 32, 40, 48, 56

On the one hand, we do have an encoding that encodes a larger range of integers (0 to 56) than the usual range of 0 to 31 given by the 5 bit unsigned integer format. But instead of 5 bits being able to represent 32 integers, we can represent only 20 integers.

We have 12 wasted encodings since there are
4 representations of 0,
2 representations of 2,
3 representations of 4,
2 representations of 6,
3 representations of 8,
2 representations of 12,
2 representations of 16,
2 representations of 24.

We can get rid of the redundant encodings by using the exponents to normalize the significands. The idea is that every non-zero significand has a 1 in it, so we shift the significand to the left until there is a one in the left most place (i.e., the "most significant bit" (MSB) of the significand), and every time we shift the significand to the left, we subtract one from the exponent. For example the number 011,10 is normalized to 110,01. But notice that since every normalized significand has a 1 in the MSB, we can drop the MSB from the encoding and not store it (since we know that it is there). But instead of using just two bits in the encoded significand, we will still use three bits in the encoded significand, so the actual (normalized) significand will be 4 bits. So the un-normalized number 011,10 is normalized to 100,00 (we shift the significand twice to get the leading 1 in the fourth bit (which we drop from the encoding) and then we subtract 2 from the exponent to get 0). Let us see how this normalization can get rid of the redundant encodings. In the above un-normalized encoding, the integer 8 has three representations. Let us normalize each of these representations. The first one, 100,01, normalizes with one left shift to 000,00. The second representation, 010,10, normalizes with two left shifts to 000,00. And the third representation 001,11, normalizes after three left shifts to 000,00. So all three previous un-normalized representations for the integer 8 normalize to the same encoding, 000,00.

Here are all the encodings using normalized significands.

0 0 0, 0 0  ->   8 * 2^0 =  8
0 0 1, 0 0  ->   9 * 2^0 =  9
0 1 0, 0 0  ->  10 * 2^0 = 10
0 1 1, 0 0  ->  11 * 2^0 = 11
1 0 0, 0 0  ->  12 * 2^0 = 12
1 0 1, 0 0  ->  13 * 2^0 = 13
1 1 0, 0 0  ->  14 * 2^0 = 14
1 1 1, 0 0  ->  15 * 2^0 = 15

0 0 0, 0 1  ->   8 * 2^1 = 16
0 0 1, 0 1  ->   9 * 2^1 = 18
0 1 0, 0 1  ->  10 * 2^1 = 20
0 1 1, 0 1  ->  11 * 2^1 = 22
1 0 0, 0 1  ->  12 * 2^1 = 24
1 0 1, 0 1  ->  13 * 2^1 = 26
1 1 0, 0 1  ->  14 * 2^1 = 28
1 1 1, 0 1  ->  15 * 2^1 = 30

0 0 0, 1 0  ->   8 * 2^2 = 32
0 0 1, 1 0  ->   9 * 2^2 = 36
0 1 0, 1 0  ->  10 * 2^2 = 40
0 1 1, 1 0  ->  11 * 2^2 = 44
1 0 0, 1 0  ->  12 * 2^2 = 48
1 0 1, 1 0  ->  13 * 2^2 = 52
1 1 0, 1 0  ->  14 * 2^2 = 56
1 1 1, 1 0  ->  15 * 2^2 = 60

0 0 0, 1 1  ->   8 * 2^3 =  64
0 0 1, 1 1  ->   9 * 2^3 =  72
0 1 0, 1 1  ->  10 * 2^3 =  80
0 1 1, 1 1  ->  11 * 2^3 =  88
1 0 0, 1 1  ->  12 * 2^3 =  96
1 0 1, 1 1  ->  13 * 2^3 = 104
1 1 0, 1 1  ->  14 * 2^3 = 112
1 1 1, 1 1  ->  15 * 2^3 = 120

Notice that the following 32 integers have representations in this encoding.

8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 20, 22, 24, 26, 28, 30,
32, 36, 40, 44, 48, 52, 56, 60, 64, 72, 80, 88, 96, 104, 112, 120.

Normalizing the significands did two things. First, it eliminated the redundant encodings, so we get 32 distinct encodings out of our 5 bits. And second, since every normalized significand has a (hidden) fourth bit, we can encode a larger range of integers since we have a larger significand to work with. The previous encoding encoded 20 integers out of the range from 0 to 56 (approximately 35% of the integers in the range) and the second, normalized encoding encoded 32 integers out of the range from 8 to 120 (approximately 28% of the integers in the range). This shows that normalization does a very efficient job of increasing the range without significantly increasing the amount of rounding needed when we approximate an integer that doesn't have an encoding with one of the integers that does have an encoding.

Exercise: Work out the details of the encoding that uses 4 bits for the significand and 2 bits for the exponent and is not normalized. How many distinct integers have encodings? What is the range of the encoding? How does 4 explicit bits in the significand compare with 3 explicit bits and 1 implicit, normalized bit?

Normalization solved one problem for us, but it created another problem. We have eliminated the redundant encodings, but the price that we pay is that 0 no longer has an encoding! Also, since we are not allowing negative exponents, we could not normalize any numbers from the previous encoding that were less than 8. We fix both of these problems by denormalizing some of our significands. We will denormalize those significands that have 0 as an exponent. Here is what we get.

Here are our encodings with denormalization of encodings with zero exponents.

0 0 0, 0 0  ->   0 * 2^0 =  0   // denormalized
0 0 1, 0 0  ->   1 * 2^0 =  1
0 1 0, 0 0  ->   2 * 2^0 =  2
0 1 1, 0 0  ->   3 * 2^0 =  3
1 0 0, 0 0  ->   4 * 2^0 =  4
1 0 1, 0 0  ->   5 * 2^0 =  5
1 1 0, 0 0  ->   6 * 2^0 =  6
1 1 1, 0 0  ->   7 * 2^0 =  7

0 0 0, 0 1  ->   8 * 2^1 = 16   // normalized from here down
0 0 1, 0 1  ->   9 * 2^1 = 18
0 1 0, 0 1  ->  10 * 2^1 = 20
0 1 1, 0 1  ->  11 * 2^1 = 22
1 0 0, 0 1  ->  12 * 2^1 = 24
1 0 1, 0 1  ->  13 * 2^1 = 26
1 1 0, 0 1  ->  14 * 2^1 = 28
1 1 1, 0 1  ->  15 * 2^1 = 30

0 0 0, 1 0  ->   8 * 2^2 = 32
0 0 1, 1 0  ->   9 * 2^2 = 36
0 1 0, 1 0  ->  10 * 2^2 = 40
0 1 1, 1 0  ->  11 * 2^2 = 44
1 0 0, 1 0  ->  12 * 2^2 = 48
1 0 1, 1 0  ->  13 * 2^2 = 52
1 1 0, 1 0  ->  14 * 2^2 = 56
1 1 1, 1 0  ->  15 * 2^2 = 60

0 0 0, 1 1  ->   8 * 2^3 =  64
0 0 1, 1 1  ->   9 * 2^3 =  72
0 1 0, 1 1  ->  10 * 2^3 =  80
0 1 1, 1 1  ->  11 * 2^3 =  88
1 0 0, 1 1  ->  12 * 2^3 =  96
1 0 1, 1 1  ->  13 * 2^3 = 104
1 1 0, 1 1  ->  14 * 2^3 = 112
1 1 1, 1 1  ->  15 * 2^3 = 120

Notice that the following 32 integers have representations in this encoding.

0, 1, 2, 3, 4, 5, 6, 7, 16, 18, 20, 22, 24, 26, 28, 30,
32, 36, 40, 44, 48, 52, 56, 60, 64, 72, 80, 88, 96, 104, 112, 120.

Notice the big gap between the denormalized and normalized numbers. We can get rid of it with the following trick. Notice that the normalized numbers have an implied 4 bits for their significand (the MSB of all the significands is 1 and since it is always 1, it is not shown in the encoding). Let us also give the denormalized numbers 4 implied bits in their significands. We will say that every denormalized significand has 4 bits and the LSB is always 0 (and since the LSB is always 0, it will not be shown in the encoding). Here is what this new encoding looks like.

0 0 0, 0 0  ->   0 * 2^0 =  0   // denormalized, with an implied 0 as
0 0 1, 0 0  ->   2 * 2^0 =  2   // the LSB of the significand
0 1 0, 0 0  ->   4 * 2^0 =  4
0 1 1, 0 0  ->   6 * 2^0 =  6
1 0 0, 0 0  ->   8 * 2^0 =  8
1 0 1, 0 0  ->  10 * 2^0 = 10
1 1 0, 0 0  ->  12 * 2^0 = 12
1 1 1, 0 0  ->  14 * 2^0 = 14

0 0 0, 0 1  ->   8 * 2^1 = 16   // normalized, with an implied 1 as
0 0 1, 0 1  ->   9 * 2^1 = 18   // the MSB of the significand
0 1 0, 0 1  ->  10 * 2^1 = 20
0 1 1, 0 1  ->  11 * 2^1 = 22
1 0 0, 0 1  ->  12 * 2^1 = 24
1 0 1, 0 1  ->  13 * 2^1 = 26
1 1 0, 0 1  ->  14 * 2^1 = 28
1 1 1, 0 1  ->  15 * 2^1 = 30

0 0 0, 1 0  ->   8 * 2^2 = 32
0 0 1, 1 0  ->   9 * 2^2 = 36
0 1 0, 1 0  ->  10 * 2^2 = 40
0 1 1, 1 0  ->  11 * 2^2 = 44
1 0 0, 1 0  ->  12 * 2^2 = 48
1 0 1, 1 0  ->  13 * 2^2 = 52
1 1 0, 1 0  ->  14 * 2^2 = 56
1 1 1, 1 0  ->  15 * 2^2 = 60

0 0 0, 1 1  ->   8 * 2^3 =  64
0 0 1, 1 1  ->   9 * 2^3 =  72
0 1 0, 1 1  ->  10 * 2^3 =  80
0 1 1, 1 1  ->  11 * 2^3 =  88
1 0 0, 1 1  ->  12 * 2^3 =  96
1 0 1, 1 1  ->  13 * 2^3 = 104
1 1 0, 1 1  ->  14 * 2^3 = 112
1 1 1, 1 1  ->  15 * 2^3 = 120

The gap between the denormalized and normalized numbers is gone. We now have encodings for every other integer between 0 and 32, every fourth integer between 32 and 64, and every eighth integer between 64 and 120. So we now have a very smoothly changing encoding where the "smaller" integers are closeer together and the "larger" integers are further apart.

In the end, we have managed to get 5 bits to represent 32 integers over a much larger range, 0 to 120, than what we would get if we used the more traditional 5-bit unsigned integer representation, which has a range from 0 to 31.

The next question we could ask is, can we come up with algorithms for adding, subtracting, multiplying and dividing integers using this normalized, "floating point" encoding scheme? Notice that these arithmetic algorithms would need to include rounding rules, since, for example, 20 + 14 = 34, but 34 does not have an encoding, so 34 would need to be "rounded" to either 32 or 36.

Instead of working out the arithmetic algorithms and rounding rules for our integer encoding, let us see how we can get our encoding to be more like the IEEE floating point format. That is, let us see how we can encode some fractional numbers along with our integers. We can include fractional numbers in our encodings in two ways, by changing where we put the "binary point" in the significand, or by allowing negative numbers in the exponent.

First, let us try negative numbers in the exponent. Certainly negative exponents can lead to fractions (but a negative exponent does not always mean a fraction; for example 8*2^(-2) has a negative exponent but it is an integer). So if we allow negative exponents in our encodings, we can represent some fractional numbers.

The following encoding uses unsigned integers in the significands (without normalization) and it uses a biased number format in the exponent. The value of the bias is 1 (which is 2^(n-1)-1, where n=2 is the number of binary digits in the exponent). So the (biased) bit patterns 00, 01, 10, 11 in the exponent field represent (unbiased) exponents of -1, 0, 1, 2 respectively.

0 0 0, 0 0  ->  0 * 2^(-1) = 0
0 0 1, 0 0  ->  1 * 2^(-1) = 1/2
0 1 0, 0 0  ->  2 * 2^(-1) = 1
0 1 1, 0 0  ->  3 * 2^(-1) = 3/2
1 0 0, 0 0  ->  4 * 2^(-1) = 2
1 0 1, 0 0  ->  5 * 2^(-1) = 5/2
1 1 0, 0 0  ->  6 * 2^(-1) = 3
1 1 1, 0 0  ->  7 * 2^(-1) = 7/2

0 0 0, 0 1  ->  0 * 2^0 =  0
0 0 1, 0 1  ->  1 * 2^0 =  1
0 1 0, 0 1  ->  2 * 2^0 =  2
0 1 1, 0 1  ->  3 * 2^0 =  3
1 0 0, 0 1  ->  4 * 2^0 =  4
1 0 1, 0 1  ->  5 * 2^0 =  5
1 1 0, 0 1  ->  6 * 2^0 =  6
1 1 1, 0 1  ->  7 * 2^0 =  7

0 0 0, 1 0  ->  0 * 2^1 =  0
0 0 1, 1 0  ->  1 * 2^1 =  2
0 1 0, 1 0  ->  2 * 2^1 =  4
0 1 1, 1 0  ->  3 * 2^1 =  6
1 0 0, 1 0  ->  4 * 2^1 =  8
1 0 1, 1 0  ->  5 * 2^1 = 10
1 1 0, 1 0  ->  6 * 2^1 = 12
1 1 1, 1 0  ->  7 * 2^1 = 14

0 0 0, 1 1  ->  0 * 2^2 =  0
0 0 1, 1 1  ->  1 * 2^2 =  4
0 1 0, 1 1  ->  2 * 2^2 =  8
0 1 1, 1 1  ->  3 * 2^2 = 12
1 0 0, 1 1  ->  4 * 2^2 = 16
1 0 1, 1 1  ->  5 * 2^2 = 20
1 1 0, 1 1  ->  6 * 2^2 = 24
1 1 1, 1 1  ->  7 * 2^2 = 28

Notice that we have the following 20 numbers represented in this encoding.

0, 1/2, 1, 3/2, 2, 5/2, 3, 7/2, 4, 5, 6, 7, 8, 10, 12, 14, 16, 20, 24, 28.

As before, we have a lot of wasted, redundant encodings. We can fix this, as before, by normalizing the significands.

0 0 0, 0 0  ->   8 * 2^(-1) = 4      // using normalized significands
0 0 1, 0 0  ->   9 * 2^(-1) = 9/2
0 1 0, 0 0  ->  10 * 2^(-1) = 5
0 1 1, 0 0  ->  11 * 2^(-1) = 11/2
1 0 0, 0 0  ->  12 * 2^(-1) = 6
1 0 1, 0 0  ->  13 * 2^(-1) = 13/2
1 1 0, 0 0  ->  14 * 2^(-1) = 7
1 1 1, 0 0  ->  15 * 2^(-1) = 15/2

0 0 0, 0 1  ->   8 * 2^0 =  8
0 0 1, 0 1  ->   9 * 2^0 =  9
0 1 0, 0 1  ->  10 * 2^0 = 10
0 1 1, 0 1  ->  11 * 2^0 = 11
1 0 0, 0 1  ->  12 * 2^0 = 12
1 0 1, 0 1  ->  13 * 2^0 = 13
1 1 0, 0 1  ->  14 * 2^0 = 14
1 1 1, 0 1  ->  15 * 2^0 = 15

0 0 0, 1 0  ->   8 * 2^1 = 16
0 0 1, 1 0  ->   9 * 2^1 = 18
0 1 0, 1 0  ->  10 * 2^1 = 20
0 1 1, 1 0  ->  11 * 2^1 = 22
1 0 0, 1 0  ->  12 * 2^1 = 24
1 0 1, 1 0  ->  13 * 2^1 = 26
1 1 0, 1 0  ->  14 * 2^1 = 28
1 1 1, 1 0  ->  15 * 2^1 = 30

0 0 0, 1 1  ->   8 * 2^2 = 32
0 0 1, 1 1  ->   9 * 2^2 = 36
0 1 0, 1 1  ->  10 * 2^2 = 40
0 1 1, 1 1  ->  11 * 2^2 = 44
1 0 0, 1 1  ->  12 * 2^2 = 48
1 0 1, 1 1  ->  13 * 2^2 = 52
1 1 0, 1 1  ->  14 * 2^2 = 56
1 1 1, 1 1  ->  15 * 2^2 = 60

We have the following 32 numbers represented in this encoding.

4, 9/2, 5, 11/2, 6, 13/2, 7, 15/2, 8, 9, 10, 11, 12, 13, 14, 15,
16, 18, 20, 22, 24, 26, 28, 30, 32, 36, 40, 44, 48, 52, 56, 60.

As before, the normalization has removed the redundant encodings, but it has left us with no encodong for 0 (or for small numbers). So, as before, let us denormalize the significands with a (biased) exponent of 0.

0 0 0, 0 0  ->   0 * 2^(-1) = 0     // denormalized significands
0 0 1, 0 0  ->   1 * 2^(-1) = 1/2
0 1 0, 0 0  ->   2 * 2^(-1) = 1
0 1 1, 0 0  ->   3 * 2^(-1) = 3/2
1 0 0, 0 0  ->   4 * 2^(-1) = 2
1 0 1, 0 0  ->   5 * 2^(-1) = 5/2
1 1 0, 0 0  ->   6 * 2^(-1) = 3
1 1 1, 0 0  ->   7 * 2^(-1) = 7/2

0 0 0, 0 1  ->   8 * 2^0 =  8       // normalized significands
0 0 1, 0 1  ->   9 * 2^0 =  9
0 1 0, 0 1  ->  10 * 2^0 = 10
0 1 1, 0 1  ->  11 * 2^0 = 11
1 0 0, 0 1  ->  12 * 2^0 = 12
1 0 1, 0 1  ->  13 * 2^0 = 13
1 1 0, 0 1  ->  14 * 2^0 = 14
1 1 1, 0 1  ->  15 * 2^0 = 15

0 0 0, 1 0  ->   8 * 2^1 = 16
0 0 1, 1 0  ->   9 * 2^1 = 18
0 1 0, 1 0  ->  10 * 2^1 = 20
0 1 1, 1 0  ->  11 * 2^1 = 22
1 0 0, 1 0  ->  12 * 2^1 = 24
1 0 1, 1 0  ->  13 * 2^1 = 26
1 1 0, 1 0  ->  14 * 2^1 = 28
1 1 1, 1 0  ->  15 * 2^1 = 30

0 0 0, 1 1  ->   8 * 2^2 = 32
0 0 1, 1 1  ->   9 * 2^2 = 36
0 1 0, 1 1  ->  10 * 2^2 = 40
0 1 1, 1 1  ->  11 * 2^2 = 44
1 0 0, 1 1  ->  12 * 2^2 = 48
1 0 1, 1 1  ->  13 * 2^2 = 52
1 1 0, 1 1  ->  14 * 2^2 = 56
1 1 1, 1 1  ->  15 * 2^2 = 60

We have the following 32 numbers represented in this encoding.

0, 1/2, 1, 3/2, 2, 5/2, 3, 7/2, 8, 9, 10, 11, 12, 13, 14, 15,
16, 18, 20, 22, 24, 26, 28, 30, 32, 36, 40, 44, 48, 52, 56, 60.

Notice that the negative exponents did allow us to encode a few fractional numbers, but not too many. Let us now try the other method for encoding fractional numbers, moving the binary point in the significand.

Exercise: Redo the above encoding but using a slightly different biased number format in the exponents. The above encoding uses a bias of 1 in the exponents. Try using a bias of 2, so the (biased) bit patterns 00, 01, 10, 11 represent (unbiased) exponents of -2, -1, 0 and 1, respectively.

Let us explain more carefully what it means to move the location of the "binary point" in the significand. The binary point is analogous to the decimal point in a base 10 number like 1.25. In a base 10 number, the decimal point separates those place values that are positive powers of 10 from those place values that are negative powers of 10. In a binary number like 1010.01, the binary point separates the place values that are positive powers of 2 from the place values that are negative powers of 2. In all of the significands that we defined above, the binary point was just to the right of the LSB in the significand, so all of our significands represented integers. But we can place the binary point anywhere we want in the significand. A common choice is to use normalized significands with the binary point between the implied MSB and the real MSB of the significand. With such a binary point the significand itself is a fractional number.

The following encoding uses normalized significands with the binary point just to the right of the implied MSB (so the displayed digits in the significands all represent fracional place values). The exponents in this encoding are all positive.

0 0 0, 0 0  ->   8/8 * 2^0 = 1     // normalized significands
0 0 1, 0 0  ->   9/8 * 2^0 = 9/8
0 1 0, 0 0  ->  10/8 * 2^0 = 5/4
0 1 1, 0 0  ->  11/8 * 2^0 = 11/8
1 0 0, 0 0  ->  12/8 * 2^0 = 3/2
1 0 1, 0 0  ->  13/8 * 2^0 = 13/8
1 1 0, 0 0  ->  14/8 * 2^0 = 7/4
1 1 1, 0 0  ->  15/8 * 2^0 = 15/8

0 0 0, 0 1  ->   8/8 * 2^1 = 2
0 0 1, 0 1  ->   9/8 * 2^1 = 9/4
0 1 0, 0 1  ->  10/8 * 2^1 = 5/2
0 1 1, 0 1  ->  11/8 * 2^1 = 11/4
1 0 0, 0 1  ->  12/8 * 2^1 = 3
1 0 1, 0 1  ->  13/8 * 2^1 = 13/4
1 1 0, 0 1  ->  14/8 * 2^1 = 7/2
1 1 1, 0 1  ->  15/8 * 2^1 = 15/4

0 0 0, 1 0  ->   8/8 * 2^2 = 4
0 0 1, 1 0  ->   9/8 * 2^2 = 9/2
0 1 0, 1 0  ->  10/8 * 2^2 = 5
0 1 1, 1 0  ->  11/8 * 2^2 = 11/2
1 0 0, 1 0  ->  12/8 * 2^2 = 6
1 0 1, 1 0  ->  13/8 * 2^2 = 13/2
1 1 0, 1 0  ->  14/8 * 2^2 = 7
1 1 1, 1 0  ->  15/8 * 2^2 = 15/2

0 0 0, 1 1  ->   8/8 * 2^3 = 8
0 0 1, 1 1  ->   9/8 * 2^3 = 9
0 1 0, 1 1  ->  10/8 * 2^3 = 10
0 1 1, 1 1  ->  11/8 * 2^3 = 11
1 0 0, 1 1  ->  12/8 * 2^3 = 12
1 0 1, 1 1  ->  13/8 * 2^3 = 13
1 1 0, 1 1  ->  14/8 * 2^3 = 14
1 1 1, 1 1  ->  15/8 * 2^3 = 15

The above encoding does not encode 0 and it does not encode any small numbers. We fix that by denormalizing the significands with 0 exponent. But we keep the binanry point just to the left of the MSB of the significand.

0 0 0, 0 0  ->     0 * 2^0 = 0      // denormalized significands
0 0 1, 0 0  ->   1/8 * 2^0 = 1/8
0 1 0, 0 0  ->   2/8 * 2^0 = 1/4
0 1 1, 0 0  ->   3/8 * 2^0 = 3/8
1 0 0, 0 0  ->   4/8 * 2^0 = 1/2
1 0 1, 0 0  ->   5/8 * 2^0 = 5/8
1 1 0, 0 0  ->   6/8 * 2^0 = 3/4
1 1 1, 0 0  ->   7/8 * 2^0 = 7/8

0 0 0, 0 1  ->   8/8 * 2^1 = 2      // normalized significands
0 0 1, 0 1  ->   9/8 * 2^1 = 9/4
0 1 0, 0 1  ->  10/8 * 2^1 = 5/2
0 1 1, 0 1  ->  11/8 * 2^1 = 11/4
1 0 0, 0 1  ->  12/8 * 2^1 = 3
1 0 1, 0 1  ->  13/8 * 2^1 = 13/4
1 1 0, 0 1  ->  14/8 * 2^1 = 7/2
1 1 1, 0 1  ->  15/8 * 2^1 = 15/4

0 0 0, 1 0  ->   8/8 * 2^2 = 4
0 0 1, 1 0  ->   9/8 * 2^2 = 9/2
0 1 0, 1 0  ->  10/8 * 2^2 = 5
0 1 1, 1 0  ->  11/8 * 2^2 = 11/2
1 0 0, 1 0  ->  12/8 * 2^2 = 6
1 0 1, 1 0  ->  13/8 * 2^2 = 13/2
1 1 0, 1 0  ->  14/8 * 2^2 = 7
1 1 1, 1 0  ->  15/8 * 2^2 = 15/2

0 0 0, 1 1  ->   8/8 * 2^3 = 8
0 0 1, 1 1  ->   9/8 * 2^3 = 9
0 1 0, 1 1  ->  10/8 * 2^3 = 10
0 1 1, 1 1  ->  11/8 * 2^3 = 11
1 0 0, 1 1  ->  12/8 * 2^3 = 12
1 0 1, 1 1  ->  13/8 * 2^3 = 13
1 1 0, 1 1  ->  14/8 * 2^3 = 14
1 1 1, 1 1  ->  15/8 * 2^3 = 15

Notice that we now have a fairly large gap between the largest denormalized number (7/8) and the smallest normalized number (2). If we combine our two methods for introducing fractional numbers, we can reduce the size of this gap and get a more smoothly transitioning number system.

Our next encoding uses all of the ideas we have presented above. It has negative exponents (using a biased number format), normalized significands with the binary point to the right of the (implied) MSB of the significand (so the significands represent fractions), and denormalized significands (when the biased exponent is 0) with the binary point to the left of the MSB of the significand.

0 0 0, 0 0  ->     0 * 2^(-1) = 0      // denormalized significands
0 0 1, 0 0  ->   1/8 * 2^(-1) = 1/16
0 1 0, 0 0  ->   2/8 * 2^(-1) = 1/8
0 1 1, 0 0  ->   3/8 * 2^(-1) = 3/16
1 0 0, 0 0  ->   4/8 * 2^(-1) = 1/4
1 0 1, 0 0  ->   5/8 * 2^(-1) = 5/16
1 1 0, 0 0  ->   6/8 * 2^(-1) = 3/8
1 1 1, 0 0  ->   7/8 * 2^(-1) = 7/16

0 0 0, 0 1  ->   8/8 * 2^0 = 1         // normalized significands
0 0 1, 0 1  ->   9/8 * 2^0 = 9/8
0 1 0, 0 1  ->  10/8 * 2^0 = 5/4
0 1 1, 0 1  ->  11/8 * 2^0 = 11/8
1 0 0, 0 1  ->  12/8 * 2^0 = 3/2
1 0 1, 0 1  ->  13/8 * 2^0 = 13/8
1 1 0, 0 1  ->  14/8 * 2^0 = 7/4
1 1 1, 0 1  ->  15/8 * 2^0 = 15/8

0 0 0, 1 0  ->   8/8 * 2^1 = 2
0 0 1, 1 0  ->   9/8 * 2^1 = 9/4
0 1 0, 1 0  ->  10/8 * 2^1 = 5/2
0 1 1, 1 0  ->  11/8 * 2^1 = 11/4
1 0 0, 1 0  ->  12/8 * 2^1 = 3
1 0 1, 1 0  ->  13/8 * 2^1 = 13/4
1 1 0, 1 0  ->  14/8 * 2^1 = 7/2
1 1 1, 1 0  ->  15/8 * 2^1 = 15/4

0 0 0, 1 1  ->   8/8 * 2^2 = 4
0 0 1, 1 1  ->   9/8 * 2^2 = 9/2
0 1 0, 1 1  ->  10/8 * 2^2 = 5
0 1 1, 1 1  ->  11/8 * 2^2 = 11/2
1 0 0, 1 1  ->  12/8 * 2^2 = 6
1 0 1, 1 1  ->  13/8 * 2^2 = 13/2
1 1 0, 1 1  ->  14/8 * 2^2 = 7
1 1 1, 1 1  ->  15/8 * 2^2 = 15/2

But we still have a gap between the largest denormalized number (7/16) and the smallest normalized number (1). We fix this by changing the way that we denormalize the numbers with 0 (biased) exponent. There are two ways to think about what we shall do. The first way is to think of the binary point as being to the right of the MSB (instead of to the left of the MSB). The other way is to change the (unbiased) exponent from -bias to 1-bias (recall that our bias is 1). That is the way that the CS:APP textbook describes this encoding on page 84.

0 0 0, 0 0  ->     0 * 2^(-1) =   0 * 2^0 = 0      // denormalized significands with
0 0 1, 0 0  ->   1/4 * 2^(-1) = 1/8 * 2^0 = 1/8    // binary point to right of MSB
0 1 0, 0 0  ->   2/4 * 2^(-1) = 2/8 * 2^0 = 1/4
0 1 1, 0 0  ->   3/4 * 2^(-1) = 3/8 * 2^0 = 3/8
1 0 0, 0 0  ->   4/4 * 2^(-1) = 4/8 * 2^0 = 1/2
1 0 1, 0 0  ->   5/4 * 2^(-1) = 5/8 * 2^0 = 5/8
1 1 0, 0 0  ->   6/4 * 2^(-1) = 6/8 * 2^0 = 3/4
1 1 1, 0 0  ->   7/4 * 2^(-1) = 7/8 * 2^0 = 7/8

0 0 0, 0 1  ->   8/8 * 2^0 = 1                     // normalized significands
0 0 1, 0 1  ->   9/8 * 2^0 = 9/8
0 1 0, 0 1  ->  10/8 * 2^0 = 5/4
0 1 1, 0 1  ->  11/8 * 2^0 = 11/8
1 0 0, 0 1  ->  12/8 * 2^0 = 3/2
1 0 1, 0 1  ->  13/8 * 2^0 = 13/8
1 1 0, 0 1  ->  14/8 * 2^0 = 7/4
1 1 1, 0 1  ->  15/8 * 2^0 = 15/8

0 0 0, 1 0  ->   8/8 * 2^1 = 2
0 0 1, 1 0  ->   9/8 * 2^1 = 9/4
0 1 0, 1 0  ->  10/8 * 2^1 = 5/2
0 1 1, 1 0  ->  11/8 * 2^1 = 11/4
1 0 0, 1 0  ->  12/8 * 2^1 = 3
1 0 1, 1 0  ->  13/8 * 2^1 = 13/4
1 1 0, 1 0  ->  14/8 * 2^1 = 7/2
1 1 1, 1 0  ->  15/8 * 2^1 = 15/4

0 0 0, 1 1  ->   8/8 * 2^2 = 4
0 0 1, 1 1  ->   9/8 * 2^2 = 9/2
0 1 0, 1 1  ->  10/8 * 2^2 = 5
0 1 1, 1 1  ->  11/8 * 2^2 = 11/2
1 0 0, 1 1  ->  12/8 * 2^2 = 6
1 0 1, 1 1  ->  13/8 * 2^2 = 13/2
1 1 0, 1 1  ->  14/8 * 2^2 = 7
1 1 1, 1 1  ->  15/8 * 2^2 = 15/2

The change that we made in the interpretation of denormalized numbers has removed the gap that was between the denormalized and normalized numbers. The gap was removed by "spreading out" the eight denormalized numbers so that they uniformly cover the range from 0 to 1. Spreading out the denormalized numbers gives our encoding a property that is referred to as "gradual underflow" (see CS:APP, page 84). Notice how we now have a very smoothly changing number system. The encoded numbers between 0 and 2 are separated by gaps of size 1/8, the encoded numbers between 2 and 4 are separated by gaps of size 1/4, and the encoded numbers between 4 and 8 are separated by gaps of 1/2.

The above encoding has almost all of the characteristics of the IEEE floating point number system. Aside from having only a 3 bit significand and a 2 bit exponent, the only other difference is that the above encoding does not have a sign bit and it has the wrong order of significand and exponent. Adding a sign bit to the encoding is a trivial step (the only interesting feature of the sign bit is that it gives us both a plus and a minus zero). What is more interesting is to consider the order of the significand and the exponent in the encoding. In fact, consideration of this order will solve a slight mystery that we have so far avoided, which is, why are we using a biased number system in the exponents (to represent negative numbers) instead of twos complement?

In all of the above tables we have written the binary encodings with the three bits of the significand on the left of the two bits of the exponent. We wrote them this way by analogy to the way that floating point literals are written in C, e.g., 23e4 means 23*10^4. In a C floating point literal we write the significand first followed by the exponent (with the character 'e' between them). So the notation 101,01 in the above tables was meant to make it a bit easier to remember that we are working with a floating point notation, and that 101,01 encodes the number 1.101*2^(01). So the order in which we wrote the significand and the exponent was purely for our convenience. Let us look at what happens if we change this order and put the exponents on the left of the significands (as they are in the IEEE floating point format). The following table is the same as the previous table but with the two bits of the exponents on the left of the three bits of the significand.

0 0 0 0 0  ->     0 * 2^(-1) =   0 * 2^0 = 0      // denormalized significands with
0 0 0 0 1  ->   1/4 * 2^(-1) = 1/8 * 2^0 = 1/8    // binary point to right of MSB
0 0 0 1 0  ->   2/4 * 2^(-1) = 2/8 * 2^0 = 1/4
0 0 0 1 1  ->   3/4 * 2^(-1) = 3/8 * 2^0 = 3/8
0 0 1 0 0  ->   4/4 * 2^(-1) = 4/8 * 2^0 = 1/2
0 0 1 0 1  ->   5/4 * 2^(-1) = 5/8 * 2^0 = 5/8
0 0 1 1 0  ->   6/4 * 2^(-1) = 6/8 * 2^0 = 3/4
0 0 1 1 1  ->   7/4 * 2^(-1) = 7/8 * 2^0 = 7/8

0 1 0 0 0  ->   8/8 * 2^0 = 1                     // normalized significands
0 1 0 0 1  ->   9/8 * 2^0 = 9/8
0 1 0 1 0  ->  10/8 * 2^0 = 5/4
0 1 0 1 1  ->  11/8 * 2^0 = 11/8
0 1 1 0 0  ->  12/8 * 2^0 = 3/2
0 1 1 0 1  ->  13/8 * 2^0 = 13/8
0 1 1 1 0  ->  14/8 * 2^0 = 7/4
0 1 1 1 1  ->  15/8 * 2^0 = 15/8

1 0 0 0 0  ->   8/8 * 2^1 = 2
1 0 0 0 1  ->   9/8 * 2^1 = 9/4
1 0 0 1 0  ->  10/8 * 2^1 = 5/2
1 0 0 1 1  ->  11/8 * 2^1 = 11/4
1 0 1 0 0  ->  12/8 * 2^1 = 3
1 0 1 0 1  ->  13/8 * 2^1 = 13/4
1 0 1 1 0  ->  14/8 * 2^1 = 7/2
1 0 1 1 1  ->  15/8 * 2^1 = 15/4

1 1 0 0 0  ->   8/8 * 2^2 = 4
1 1 0 0 1  ->   9/8 * 2^2 = 9/2
1 1 0 1 0  ->  10/8 * 2^2 = 5
1 1 0 1 1  ->  11/8 * 2^2 = 11/2
1 1 1 0 0  ->  12/8 * 2^2 = 6
1 1 1 0 1  ->  13/8 * 2^2 = 13/2
1 1 1 1 0  ->  14/8 * 2^2 = 7
1 1 1 1 1  ->  15/8 * 2^2 = 15/2

Notice something tremendously elegant and useful. The 32 binary encodings are in ascending order in two senses. They are in ascending order as decimal numbers (the encoded numbers in the right hand column of the table) and they are in ascending order as binary words (the binary codes in the left hand column of the table). An example of how this can be used is that the relational operators, < and >, for decimal numbers can be implemented using integer comparisons (see CS:APP, pages 86-87). There is no need to build into a computer's CPU special instructions for comparing floating point numbers, the integer comparison instructions can be used instead.

This is a very clever piece of engineering. Notice that getting the decimal and binary orderings to agree takes a combination of two distinct ideas. First, we need to use a biased encoding of the exponent field. Second, we need to put the exponent on the left of the significand in the encodings. As an exercise, rework the above encoding using 2-bit twos-complement exponents, so 10 would represent -2, 11 would represent -1, 00 would represent 0, 01 would represent 1 (notice that this is in fact a different set of exponents than what we used above and you will have to denormalize the numbers with exponent 10, not exponent 00). If you try this you will see that there is no way that you will be able to get both the decimal ordering and the binary ordering to agree.

Return to the CS 223 lecture page.
Return to the CS 223 home page.

compliments and criticisms