Most computer architecture books describe the IEEE floating-point number format in detail, but few try to motivate how that format was arrived at. In this page we use a "floating-point" like format for integers as a way to motivate why the floating-point format, with its normalized and denormalized numbers and biased exponents, is structured the way it is.
Here is a problem that motivates some of what we will do. Suppose we have just 5 bits to use to represent positive integers. We can represent at most 32 different positive integers using 5 bits. The usual way to represent positive integers using 5 bits is to use a 5 bit unsigned integer format, which will encode the integers from 0 to 31. But suppose we want to encode a wider range of numbers than the range from 0 to 31? That is the problem that we want to explore, using 5 bits to represent a wider range of integers than the usual "fixed-point" unsigned integer format (we will explain soon what we mean by "fixed-point" vs. "floating-point").
To get a sense of what we will do, consider the number 12,300,000,000,000. It took 14 digits (not counting the "punctuation marks", the commas) to represent that number. But there is an alternative notation for such large numbers. We can write this number, using "scientific notation", as, for example, 123e11, which represents 123*10^(11). So by making use of an exponential notation we can write a 14 digit number using just 5 digits (not counting the "punctuation mark", the e). Notice that this 5 digit format can represent a number as large as 999e99, which has 102 digits if written in the usual decimal format. So five digits, used in an exponential format, can represent numbers of far more than 5 places in the usual decimal format.
Notice one thing though. Not all large integers can be represented by fewer digits when we use scientific notation. The number 12,356,789,876,543 has fourteen digits, and scientific notation is not going to make this number any easier to write. But if we are willing to approximate this number, then the number 124e11 can be considered a reasonable substitute for the original number. In fact, 124e11 will be the "reasonable" approximation for a lot of 14 digit integers (can you determine exactly how many?). We say that the original number 12,356,789,876,543 has 14 significant digits and its approximation 124e11 has only three significant digits.
Now let us create a binary encoding of positive integers that mimics "scientific notation" so that we can encode large integers in a small number of bits. In particular, we shall create a 5 bit encoding that encodes integers from a far larger range than the 0 to 31 range determined by the unsigned integer format.
Let the notation 101e11 be a binary scientific notation for 5*2^(3)=40. The binary number before the e is called the significand and the binary number after the e is called the exponent. Let us use this notation as a way to define an encoding of 5 bit words into positive integers. For any given 5 bit word, let the three most significant bits be the significand and let the two least significant bits be the exponents. Below is a table showing how this encoding encodes 5 bit words into integers. We write the 5 bit words with a comma in them to help distinguish the significand from the exponent. (Notice that we are using 3 bit and 2 bit unsigned integer formats in the significand and exponent, respectively.)
0 0 0, 0 0 -> 0 * 2^0 = 0 0 0 1, 0 0 -> 1 * 2^0 = 1 0 1 0, 0 0 -> 2 * 2^0 = 2 0 1 1, 0 0 -> 3 * 2^0 = 3 1 0 0, 0 0 -> 4 * 2^0 = 4 1 0 1, 0 0 -> 5 * 2^0 = 5 1 1 0, 0 0 -> 6 * 2^0 = 6 1 1 1, 0 0 -> 7 * 2^0 = 7 0 0 0, 0 1 -> 0 * 2^1 = 0 0 0 1, 0 1 -> 1 * 2^1 = 2 0 1 0, 0 1 -> 2 * 2^1 = 4 0 1 1, 0 1 -> 3 * 2^1 = 6 1 0 0, 0 1 -> 4 * 2^1 = 8 1 0 1, 0 1 -> 5 * 2^1 = 10 1 1 0, 0 1 -> 6 * 2^1 = 12 1 1 1, 0 1 -> 7 * 2^1 = 14 0 0 0, 1 0 -> 0 * 2^2 = 0 0 0 1, 1 0 -> 1 * 2^2 = 4 0 1 0, 1 0 -> 2 * 2^2 = 8 0 1 1, 1 0 -> 3 * 2^2 = 12 1 0 0, 1 0 -> 4 * 2^2 = 16 1 0 1, 1 0 -> 5 * 2^2 = 20 1 1 0, 1 0 -> 6 * 2^2 = 24 1 1 1, 1 0 -> 7 * 2^2 = 28 0 0 0, 1 1 -> 0 * 2^3 = 0 0 0 1, 1 1 -> 1 * 2^3 = 8 0 1 0, 1 1 -> 2 * 2^3 = 16 0 1 1, 1 1 -> 3 * 2^3 = 24 1 0 0, 1 1 -> 4 * 2^3 = 32 1 0 1, 1 1 -> 5 * 2^3 = 40 1 1 0, 1 1 -> 6 * 2^3 = 48 1 1 1, 1 1 -> 7 * 2^3 = 56
Notice that the following 20 integers have representations in this encoding.
0, 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 20, 24, 28, 32, 40, 48, 56
On the one hand, we do have an encoding that encodes a larger range of integers (0 to 56) than the usual range of 0 to 31 given by the 5 bit unsigned integer format. But instead of 5 bits being able to represent 32 integers, we can represent only 20 integers.
We have 12 wasted encodings since there are
4 representations of 0,
2 representations of 2,
3 representations of 4,
2 representations of 6,
3 representations of 8,
2 representations of 12,
2 representations of 16,
2 representations of 24.
We can get rid of the redundant encodings by normalizing the significands. The idea is that every non-zero significand has a 1 in it, so we shift the significand to the left until there is a one in the left most place (i.e., the "most significant bit" (MSB) of the significand), and every time we shift the significand to the left, we subtract one from the exponent. For example the number 011,10 is normalized to 110,01. But notice that since every normalized significand has a 1 in the MSB, we can drop the MSB from the encoding and not store it (since we know that it is there). But instead of using just two bits in the encoded significand, we will still use three bits in the encoded significand, so the actual (normalized) significand will be 4 bits. So the un-normalized number 011,10 is normalized to 100,00 (we shift the significand twice to get the leading 1 in the fourth bit (which we drop from the encoding) and then we subtract 2 from the exponent to get 0). Let us see how this normalization can get rid of the redundant encodings. In the above un-normalized encoding, the integer 8 has three representations. Let us normalize each of these representations. The first one, 100,01, normalizes with one left shift to 000,00. The second representation, 010,10, normalizes with two left shifts to 000,00. And the third representation 001,11, normalizes after three left shifts to 000,00. So all three previous un-normalized representations for the integer 8 normalize to the same encoding, 000,00.
In practice, we "normalize" the significands in our encoding by pretending that there is an extra 1 to the left of the significand's most significant bit.
Here are all the encodings using normalized significands. Remember that each significand now has an implied most significant bit of 1 (which is not shown in the table).
0 0 0, 0 0 -> 8 * 2^0 = 8 // using normalized significands 0 0 1, 0 0 -> 9 * 2^0 = 9 0 1 0, 0 0 -> 10 * 2^0 = 10 0 1 1, 0 0 -> 11 * 2^0 = 11 1 0 0, 0 0 -> 12 * 2^0 = 12 1 0 1, 0 0 -> 13 * 2^0 = 13 1 1 0, 0 0 -> 14 * 2^0 = 14 1 1 1, 0 0 -> 15 * 2^0 = 15 0 0 0, 0 1 -> 8 * 2^1 = 16 0 0 1, 0 1 -> 9 * 2^1 = 18 0 1 0, 0 1 -> 10 * 2^1 = 20 0 1 1, 0 1 -> 11 * 2^1 = 22 1 0 0, 0 1 -> 12 * 2^1 = 24 1 0 1, 0 1 -> 13 * 2^1 = 26 1 1 0, 0 1 -> 14 * 2^1 = 28 1 1 1, 0 1 -> 15 * 2^1 = 30 0 0 0, 1 0 -> 8 * 2^2 = 32 0 0 1, 1 0 -> 9 * 2^2 = 36 0 1 0, 1 0 -> 10 * 2^2 = 40 0 1 1, 1 0 -> 11 * 2^2 = 44 1 0 0, 1 0 -> 12 * 2^2 = 48 1 0 1, 1 0 -> 13 * 2^2 = 52 1 1 0, 1 0 -> 14 * 2^2 = 56 1 1 1, 1 0 -> 15 * 2^2 = 60 0 0 0, 1 1 -> 8 * 2^3 = 64 0 0 1, 1 1 -> 9 * 2^3 = 72 0 1 0, 1 1 -> 10 * 2^3 = 80 0 1 1, 1 1 -> 11 * 2^3 = 88 1 0 0, 1 1 -> 12 * 2^3 = 96 1 0 1, 1 1 -> 13 * 2^3 = 104 1 1 0, 1 1 -> 14 * 2^3 = 112 1 1 1, 1 1 -> 15 * 2^3 = 120
Notice that the following 32 integers have representations in this encoding.
8, 9, 10, 11, 12, 13, 14, 15, 16, 18, 20, 22, 24, 26, 28, 30, 32, 36, 40, 44, 48, 52, 56, 60, 64, 72, 80, 88, 96, 104, 112, 120.
Normalizing the significands did two things. First, it eliminated the redundant encodings, so we get 32 distinct encodings out of our 5 bits. And second, since every normalized significand has a (hidden) fourth bit, we can encode a larger range of integers since we have a larger significand to work with. The previous encoding encoded 20 integers out of the range from 0 to 56 (approximately 35% of the integers in the range) and the second, normalized encoding encoded 32 integers out of the range from 8 to 120 (approximately 28% of the integers in the range). This shows that normalization does a very efficient job of increasing the range without significantly increasing the amount of rounding needed when we approximate an integer that doesn't have an encoding with one of the integers that does have an encoding.
Exercise: Work out the details of the encoding that uses 4 bits for the significand and 2 bits for the exponent and is not normalized. How many distinct integers have encodings? What is the range of the encoding? How does 4 explicit bits in the significand compare with 3 explicit bits and 1 implicit, normalized bit?
Normalization solved one problem for us, but it created another problem. We have eliminated the redundant encodings, but the price we pay is that 0 no longer has an encoding! In fact, none of the integers less than 8 have an encoding. We fix this problem by denormalizing some of our significands. We will denormalize those significands that have 0 as an exponent. That is, for the numbers with 0 as their exponent, we will not give those numbers an implied 1 as their MSB.
Here are our encodings with denormalization of those significands with zero exponents.
0 0 0, 0 0 -> 0 * 2^0 = 0 // denormalized numbers 0 0 1, 0 0 -> 1 * 2^0 = 1 0 1 0, 0 0 -> 2 * 2^0 = 2 0 1 1, 0 0 -> 3 * 2^0 = 3 1 0 0, 0 0 -> 4 * 2^0 = 4 1 0 1, 0 0 -> 5 * 2^0 = 5 1 1 0, 0 0 -> 6 * 2^0 = 6 1 1 1, 0 0 -> 7 * 2^0 = 7 0 0 0, 0 1 -> 8 * 2^1 = 16 // normalized from here down 0 0 1, 0 1 -> 9 * 2^1 = 18 // there is an implied 1 as 0 1 0, 0 1 -> 10 * 2^1 = 20 // the MSB of each significand 0 1 1, 0 1 -> 11 * 2^1 = 22 1 0 0, 0 1 -> 12 * 2^1 = 24 1 0 1, 0 1 -> 13 * 2^1 = 26 1 1 0, 0 1 -> 14 * 2^1 = 28 1 1 1, 0 1 -> 15 * 2^1 = 30 0 0 0, 1 0 -> 8 * 2^2 = 32 0 0 1, 1 0 -> 9 * 2^2 = 36 0 1 0, 1 0 -> 10 * 2^2 = 40 0 1 1, 1 0 -> 11 * 2^2 = 44 1 0 0, 1 0 -> 12 * 2^2 = 48 1 0 1, 1 0 -> 13 * 2^2 = 52 1 1 0, 1 0 -> 14 * 2^2 = 56 1 1 1, 1 0 -> 15 * 2^2 = 60 0 0 0, 1 1 -> 8 * 2^3 = 64 0 0 1, 1 1 -> 9 * 2^3 = 72 0 1 0, 1 1 -> 10 * 2^3 = 80 0 1 1, 1 1 -> 11 * 2^3 = 88 1 0 0, 1 1 -> 12 * 2^3 = 96 1 0 1, 1 1 -> 13 * 2^3 = 104 1 1 0, 1 1 -> 14 * 2^3 = 112 1 1 1, 1 1 -> 15 * 2^3 = 120
Notice that the following 32 integers have representations in this encoding.
0, 1, 2, 3, 4, 5, 6, 7, 16, 18, 20, 22, 24, 26, 28, 30, 32, 36, 40, 44, 48, 52, 56, 60, 64, 72, 80, 88, 96, 104, 112, 120.
Notice the big gap between the denormalized and normalized numbers. We can get rid of that gap with a trick. We can subtract a bias from the exponent of each normalized number. We will use a bias of 1
and so we will subtract 1 from each normalized number's exponent.
0 0 0, 0 0 -> 0 * 2^0 = 0 // denormalized numbers 0 0 1, 0 0 -> 1 * 2^0 = 1 0 1 0, 0 0 -> 2 * 2^0 = 2 0 1 1, 0 0 -> 3 * 2^0 = 3 1 0 0, 0 0 -> 4 * 2^0 = 4 1 0 1, 0 0 -> 5 * 2^0 = 5 1 1 0, 0 0 -> 6 * 2^0 = 5 1 1 1, 0 0 -> 7 * 2^0 = 7 0 0 0, 0 1 -> 8 * 2^0 = 8 // normalized, with an implied 1 as 0 0 1, 0 1 -> 9 * 2^0 = 9 // the MSB of each significand, and 0 1 0, 0 1 -> 10 * 2^0 = 10 // a bias of 1 subtracted from each exponent 0 1 1, 0 1 -> 11 * 2^0 = 11 1 0 0, 0 1 -> 12 * 2^0 = 12 1 0 1, 0 1 -> 13 * 2^0 = 13 1 1 0, 0 1 -> 14 * 2^0 = 14 1 1 1, 0 1 -> 15 * 2^0 = 15 0 0 0, 1 0 -> 8 * 2^1 = 16 0 0 1, 1 0 -> 9 * 2^1 = 18 0 1 0, 1 0 -> 10 * 2^1 = 20 0 1 1, 1 0 -> 11 * 2^1 = 22 1 0 0, 1 0 -> 12 * 2^1 = 24 1 0 1, 1 0 -> 13 * 2^1 = 26 1 1 0, 1 0 -> 14 * 2^1 = 28 1 1 1, 1 0 -> 15 * 2^1 = 30 0 0 0, 1 1 -> 8 * 2^2 = 32 0 0 1, 1 1 -> 9 * 2^2 = 36 0 1 0, 1 1 -> 10 * 2^2 = 40 0 1 1, 1 1 -> 11 * 2^2 = 44 1 0 0, 1 1 -> 12 * 2^2 = 48 1 0 1, 1 1 -> 13 * 2^2 = 52 1 1 0, 1 1 -> 14 * 2^2 = 56 1 1 1, 1 1 -> 15 * 2^2 = 60
The gap between the denormalized and normalized numbers is gone. We now have encodings for every integer between 0 and 16, every other integer between 16 and 32, and every fourth integer between 32 and 60. We have a very smoothly changing encoding where the smaller integers are closer together and the larger integers are further apart.
Notice that the range is not as great as it was before we used the bias. By changing the value of the bias, we can control the range of represented numbers. In the next table, we use a bias of zero on the normalized numbers and a bias of -1 on the denormalized numbers, and we get back our previous range.
0 0 0, 0 0 -> 0 * 2^1 = 0 // denormalized, with a bias of -1 0 0 1, 0 0 -> 1 * 2^1 = 2 // subtracted from each exponent 0 1 0, 0 0 -> 2 * 2^1 = 4 0 1 1, 0 0 -> 3 * 2^1 = 6 1 0 0, 0 0 -> 4 * 2^1 = 8 1 0 1, 0 0 -> 5 * 2^1 = 10 1 1 0, 0 0 -> 6 * 2^1 = 12 1 1 1, 0 0 -> 7 * 2^1 = 14 0 0 0, 0 1 -> 8 * 2^1 = 16 // normalized, with an implied 1 as 0 0 1, 0 1 -> 9 * 2^1 = 18 // the MSB of each significand, and 0 1 0, 0 1 -> 10 * 2^1 = 20 // a bias of 0 subtracted from each exponent 0 1 1, 0 1 -> 11 * 2^1 = 22 1 0 0, 0 1 -> 12 * 2^1 = 24 1 0 1, 0 1 -> 13 * 2^1 = 26 1 1 0, 0 1 -> 14 * 2^1 = 28 1 1 1, 0 1 -> 15 * 2^1 = 30 0 0 0, 1 0 -> 8 * 2^2 = 32 0 0 1, 1 0 -> 9 * 2^2 = 36 0 1 0, 1 0 -> 10 * 2^2 = 40 0 1 1, 1 0 -> 11 * 2^2 = 44 1 0 0, 1 0 -> 12 * 2^2 = 48 1 0 1, 1 0 -> 13 * 2^2 = 52 1 1 0, 1 0 -> 14 * 2^2 = 56 1 1 1, 1 0 -> 15 * 2^2 = 60 0 0 0, 1 1 -> 8 * 2^3 = 64 0 0 1, 1 1 -> 9 * 2^3 = 72 0 1 0, 1 1 -> 10 * 2^3 = 80 0 1 1, 1 1 -> 11 * 2^3 = 88 1 0 0, 1 1 -> 12 * 2^3 = 96 1 0 1, 1 1 -> 13 * 2^3 = 104 1 1 0, 1 1 -> 14 * 2^3 = 112 1 1 1, 1 1 -> 15 * 2^3 = 120
The gap between the denormalized and normalized numbers is gone and we have encodings for every other integer between 0 and 32, every fourth integer between 32 and 64, and every eighth integer between 64 and 120. We again have a very smoothly changing encoding where the smaller integers are closer together and the larger integers are further apart.
The idea of a bias is important and, as we will see below, it is used in the floating-point fractional number systems.
If we change the bias to be -2 for the denormalized numbers and -1 for the normalized numbers, then we get the following encodings.
0 0 0, 0 0 -> 0 * 2^2 = 0 // denormalized, with a bias of -2 0 0 1, 0 0 -> 1 * 2^2 = 4 // subtracted from each exponent 0 1 0, 0 0 -> 2 * 2^2 = 8 0 1 1, 0 0 -> 3 * 2^2 = 12 1 0 0, 0 0 -> 4 * 2^2 = 16 1 0 1, 0 0 -> 5 * 2^2 = 20 1 1 0, 0 0 -> 6 * 2^2 = 24 1 1 1, 0 0 -> 7 * 2^2 = 28 0 0 0, 0 1 -> 8 * 2^2 = 32 // normalized, with an implied 1 as 0 0 1, 0 1 -> 9 * 2^2 = 36 // the MSB of each significand, and 0 1 0, 0 1 -> 10 * 2^2 = 40 // a bias of -1 subtracted from each exponent 0 1 1, 0 1 -> 11 * 2^2 = 44 1 0 0, 0 1 -> 12 * 2^2 = 48 1 0 1, 0 1 -> 13 * 2^2 = 52 1 1 0, 0 1 -> 14 * 2^2 = 56 1 1 1, 0 1 -> 15 * 2^2 = 60 0 0 0, 1 0 -> 8 * 2^3 = 64 0 0 1, 1 0 -> 9 * 2^3 = 72 0 1 0, 1 0 -> 10 * 2^3 = 80 0 1 1, 1 0 -> 11 * 2^3 = 88 1 0 0, 1 0 -> 12 * 2^3 = 96 1 0 1, 1 0 -> 13 * 2^3 = 104 1 1 0, 1 0 -> 14 * 2^3 = 112 1 1 1, 1 0 -> 15 * 2^3 = 120 0 0 0, 1 1 -> 8 * 2^4 = 128 0 0 1, 1 1 -> 9 * 2^4 = 144 0 1 0, 1 1 -> 10 * 2^4 = 160 0 1 1, 1 1 -> 11 * 2^4 = 176 1 0 0, 1 1 -> 12 * 2^4 = 192 1 0 1, 1 1 -> 13 * 2^4 = 208 1 1 0, 1 1 -> 14 * 2^4 = 224 1 1 1, 1 1 -> 15 * 2^4 = 240
Notice that by changing the bias we can make the range as large as we want, but at the expense of losing precision. For example, the number 176 now represents a fairly large, imprecise, range of numbers from 168 to 184 (why 168 to 184?).
So far, we have managed to use 5 bits to represent 32 integers over a much larger range than what we would get if we used the 5-bit unsigned integer representation, which has a range from 0 to 31.
The next question we should ask is, can we come up with algorithms for adding, subtracting, multiplying and dividing integers using this normalized, "floating point" encoding scheme? Notice that these arithmetic algorithms would need to include rounding rules, since, for example, 20 + 14 = 34, but 34 does not have an encoding in our last table, so 34 would need to be "rounded" to either 32 or 36.
Instead of working out the arithmetic algorithms and rounding rules for our integer encodings, let us see how we can get our encodings to be more like the IEEE floating point format. That is, let us see how we can encode some fractional numbers along with our integers.
We can include fractional numbers in our encodings simply by changing the value of the bias
To get fractional numbers, we must have negative numbers in the exponent. Certainly negative exponents can lead to fractions (but a negative exponent does not always mean a fraction; for example 8*2^(-2) has a negative exponent but it is an integer). So if we choose a bias that gives us some negative exponents, we can represent some fractional numbers.
Let us use a bias of 1 subtracted from the denormalized exponents and a bias of 2 subtracted from the normalized exponents.
0 0 0, 0 0 -> 0 * 2^(-1) = 0 // denormalized, with a bias of 1 0 0 1, 0 0 -> 1 * 2^(-1) = 1/2 // subtracted from each exponent 0 1 0, 0 0 -> 2 * 2^(-1) = 1 0 1 1, 0 0 -> 3 * 2^(-1) = 3/2 1 0 0, 0 0 -> 4 * 2^(-1) = 2 1 0 1, 0 0 -> 5 * 2^(-1) = 5/2 1 1 0, 0 0 -> 6 * 2^(-1) = 3 1 1 1, 0 0 -> 7 * 2^(-1) = 7/2 0 0 0, 0 1 -> 8 * 2^(-1) = 4 // normalized, with an implied 1 as 0 0 1, 0 1 -> 9 * 2^(-1) = 9/2 // the MSB of each significand, and 0 1 0, 0 1 -> 10 * 2^(-1) = 5 // a bias of 2 subtracted from each exponent 0 1 1, 0 1 -> 11 * 2^(-1) = 11/2 1 0 0, 0 1 -> 12 * 2^(-1) = 6 1 0 1, 0 1 -> 13 * 2^(-1) = 13/2 1 1 0, 0 1 -> 14 * 2^(-1) = 7 1 1 1, 0 1 -> 15 * 2^(-1) = 15/2 0 0 0, 1 0 -> 8 * 2^0 = 8 0 0 1, 1 0 -> 9 * 2^0 = 9 0 1 0, 1 0 -> 10 * 2^0 = 10 0 1 1, 1 0 -> 11 * 2^0 = 11 1 0 0, 1 0 -> 12 * 2^0 = 12 1 0 1, 1 0 -> 13 * 2^0 = 13 1 1 0, 1 0 -> 14 * 2^0 = 14 1 1 1, 1 0 -> 15 * 2^0 = 15 0 0 0, 1 1 -> 8 * 2^1 = 16 0 0 1, 1 1 -> 9 * 2^1 = 18 0 1 0, 1 1 -> 10 * 2^1 = 20 0 1 1, 1 1 -> 11 * 2^1 = 22 1 0 0, 1 1 -> 12 * 2^1 = 24 1 0 1, 1 1 -> 13 * 2^1 = 26 1 1 0, 1 1 -> 14 * 2^1 = 28 1 1 1, 1 1 -> 15 * 2^1 = 30
Notice how we are now getting some fractional numbers. Not a lot of them. After 7.5, there are only integers.
Let us use a bias of 2 subtracted from the denormalized exponents and a bias of 3 subtracted from the normalized exponents.
0 0 0, 0 0 -> 0 * 2^(-2) = 0 // denormalized, with a bias of 2 0 0 1, 0 0 -> 1 * 2^(-2) = 1/4 // subtracted from each exponent 0 1 0, 0 0 -> 2 * 2^(-2) = 2/4 0 1 1, 0 0 -> 3 * 2^(-2) = 3/4 1 0 0, 0 0 -> 4 * 2^(-2) = 1 1 0 1, 0 0 -> 5 * 2^(-2) = 5/4 1 1 0, 0 0 -> 6 * 2^(-2) = 6/4 1 1 1, 0 0 -> 7 * 2^(-2) = 7/4 0 0 0, 0 1 -> 8 * 2^(-2) = 2 // normalized, with an implied 1 as 0 0 1, 0 1 -> 9 * 2^(-2) = 9/4 // the MSB of each significand, and 0 1 0, 0 1 -> 10 * 2^(-2) = 10/4 // a bias of 3 subtracted from each exponent 0 1 1, 0 1 -> 11 * 2^(-2) = 11/4 1 0 0, 0 1 -> 12 * 2^(-2) = 3 1 0 1, 0 1 -> 13 * 2^(-2) = 13/4 1 1 0, 0 1 -> 14 * 2^(-2) = 14/4 1 1 1, 0 1 -> 15 * 2^(-2) = 15/4 0 0 0, 1 0 -> 8 * 2^(-1) = 4 0 0 1, 1 0 -> 9 * 2^(-1) = 9/2 0 1 0, 1 0 -> 10 * 2^(-1) = 5 0 1 1, 1 0 -> 11 * 2^(-1) = 11/2 1 0 0, 1 0 -> 12 * 2^(-1) = 6 1 0 1, 1 0 -> 13 * 2^(-1) = 13/2 1 1 0, 1 0 -> 14 * 2^(-1) = 7 1 1 1, 1 0 -> 15 * 2^(-1) = 15/2 0 0 0, 1 1 -> 8 * 2^0 = 8 0 0 1, 1 1 -> 9 * 2^0 = 9 0 1 0, 1 1 -> 10 * 2^0 = 10 0 1 1, 1 1 -> 11 * 2^0 = 11 1 0 0, 1 1 -> 12 * 2^0 = 12 1 0 1, 1 1 -> 13 * 2^0 = 13 1 1 0, 1 1 -> 14 * 2^0 = 14 1 1 1, 1 1 -> 15 * 2^0 = 15
Now we have more fractional numbers. But we have a much smaller range (only up to 15).
If we use a bias of 3 subtracted from the denormalized exponents and a bias of 4 subtracted from the normalized exponents, then we will have even more fractional numbers, but the range will be only up to 15/2. As an exercise, you should create the table with those choices for the bias.
Believe it or not, the tables of numbers that we have described so far are almost exactly what is meant by "floating-point numbers". There are only four small differences between our last couple of tables and the "real" floating-point numbers used by the Pentium computer.
The most obvious difference is the number of bits. We have been using five bits and a Pentium uses 32 bits for "floats" and 64 bits for "doubles". But the only reason we are using five bits is so that we can write out the whole table of floating-point numbers. With 32 bits, the table would have four billion values. But you would compute each table value exactly as we have computed the values in the above tables (you just need to know how many of the 32 bits go in the significand, how many bits go in the exponent, and what the bias is).
The second difference between our tables and "real" floating point numbers is where the "binary point" is placed. We have treated the bits of the significand as always representing an integer number. That means we have placed the binary point to the right of the least significant bit of the significand. But in floating-point number systems, the binary point is to the left of the most significant bit of the significand and the implied normalization bit is to the left of the binary point. With that interpretation of where the binary point is placed, the bits of the significand always represent a fractional number between 1 and 2 for the normalized numbers (why?), and a fractional number between 0 and 1 for the denormalized numbers (why?).
Let us redo our table with the binary point to the left of the significand and with a bias of -2 subtracted from the denormalized exponents and a bias of -1 subtracted from the normalized exponents. Notice that we get the exact same table that we had earlier.
0 0 0, 0 0 -> 0/8 * 2^2 = 0 // denormalized, with a bias of -2 0 0 1, 0 0 -> 1/8 * 2^2 = 1/2 // subtracted from each exponent, and 0 1 0, 0 0 -> 2/8 * 2^2 = 1 // the binary point is to the left of the significand 0 1 1, 0 0 -> 3/8 * 2^2 = 3/2 1 0 0, 0 0 -> 4/8 * 2^2 = 2 1 0 1, 0 0 -> 5/8 * 2^2 = 5/2 1 1 0, 0 0 -> 6/8 * 2^2 = 3 1 1 1, 0 0 -> 7/8 * 2^2 = 7/2 0 0 0, 0 1 -> 8/8 * 2^2 = 4 // normalized, with an implied 1 as 0 0 1, 0 1 -> 9/8 * 2^2 = 9/2 // the MSB of each significand, 0 1 0, 0 1 -> 10/8 * 2^2 = 5 // a bias of -1 subtracted from each exponent, and 0 1 1, 0 1 -> 11/8 * 2^2 = 11/2 // the binary point is to the right of the implied MSB 1 0 0, 0 1 -> 12/8 * 2^2 = 6 1 0 1, 0 1 -> 13/8 * 2^2 = 13/2 1 1 0, 0 1 -> 14/8 * 2^2 = 7 1 1 1, 0 1 -> 15/8 * 2^2 = 15/2 0 0 0, 1 0 -> 8/8 * 2^3 = 8 0 0 1, 1 0 -> 9/8 * 2^3 = 9 0 1 0, 1 0 -> 10/8 * 2^3 = 10 0 1 1, 1 0 -> 11/8 * 2^3 = 11 1 0 0, 1 0 -> 12/8 * 2^3 = 12 1 0 1, 1 0 -> 13/8 * 2^3 = 13 1 1 0, 1 0 -> 14/8 * 2^3 = 14 1 1 1, 1 0 -> 15/8 * 2^3 = 15 0 0 0, 1 1 -> 8/8 * 2^4 = 16 0 0 1, 1 1 -> 9/8 * 2^4 = 18 0 1 0, 1 1 -> 10/8 * 2^4 = 20 0 1 1, 1 1 -> 11/8 * 2^4 = 22 1 0 0, 1 1 -> 12/8 * 2^4 = 24 1 0 1, 1 1 -> 13/8 * 2^4 = 26 1 1 0, 1 1 -> 14/8 * 2^4 = 28 1 1 1, 1 1 -> 15/8 * 2^4 = 30
As an exercise, redo this table with a bias of -1 subtracted from the denormalized exponents and a bias of 0 subtracted from the normalized exponents.
Redo the table again with a bias of 0 subtracted from the denormalized exponents and a bias of 1 subtracted from the normalized exponents.
The third difference between our encodings and the ones used by the "real" floating-point number system is in the order of the significand and the exponent. We have written our encodings as xxx,xx
with the three bits of the significand as the higher order bits and the two bits of the exponent as the lower order bits. We did this to mimic the scientific notation used in C, which looks like 1.3e2
, with the significand to the left of the e
and the exponent to the right. But let us see what happens if we reverse the order of the significand and the exponent.
Here is the last table but with the two bits of the exponent in the higher order bits, and the three bits of the significand in the lower order bits. Notice that our encodings are now in the normal binary ordering! This means that we can compare our "floating-point numbers" using integer comparisons.
0 0 0 0 0 -> 0/8 * 2^2 = 0 // denormalized, with a bias of -2 0 0 0 0 1 -> 1/8 * 2^2 = 1/2 // subtracted from each exponent, and 0 0 0 1 0 -> 2/8 * 2^2 = 1 // the binary point is to the left of the significand 0 0 0 1 1 -> 3/8 * 2^2 = 3/2 0 0 1 0 0 -> 4/8 * 2^2 = 2 0 0 1 0 1 -> 5/8 * 2^2 = 5/2 0 0 1 1 0 -> 6/8 * 2^2 = 3 0 0 1 1 1 -> 7/8 * 2^2 = 7/2 0 1 0 0 0 -> 8/8 * 2^2 = 4 // normalized, with an implied 1 as 0 1 0 0 1 -> 9/8 * 2^2 = 9/2 // the MSB of each significand, 0 1 0 1 0 -> 10/8 * 2^2 = 5 // a bias of -1 subtracted from each exponent, and 0 1 0 1 1 -> 11/8 * 2^2 = 11/2 // the binary point is to the right of the implied MSB 0 1 1 0 0 -> 12/8 * 2^2 = 6 0 1 1 0 1 -> 13/8 * 2^2 = 13/2 0 1 1 1 0 -> 14/8 * 2^2 = 7 0 1 1 1 1 -> 15/8 * 2^2 = 15/2 1 0 0 0 0 -> 8/8 * 2^3 = 8 1 0 0 0 1 -> 9/8 * 2^3 = 9 1 0 0 1 0 -> 10/8 * 2^3 = 10 1 0 0 1 1 -> 11/8 * 2^3 = 11 1 0 1 0 0 -> 12/8 * 2^3 = 12 1 0 1 0 1 -> 13/8 * 2^3 = 13 1 0 1 1 0 -> 14/8 * 2^3 = 14 1 0 1 1 1 -> 15/8 * 2^3 = 15 1 1 0 0 0 -> 8/8 * 2^4 = 16 1 1 0 0 1 -> 9/8 * 2^4 = 18 1 1 0 1 0 -> 10/8 * 2^4 = 20 1 1 0 1 1 -> 11/8 * 2^4 = 22 1 1 1 0 0 -> 12/8 * 2^4 = 24 1 1 1 0 1 -> 13/8 * 2^4 = 26 1 1 1 1 0 -> 14/8 * 2^4 = 28 1 1 1 1 1 -> 15/8 * 2^4 = 30
The fourth, and last, difference between our encodings and the ones used by "real" floating-point numbers is that we have not yet encoded any negative numbers. To encode negative numbers, we will add a sixth bit to out encoding and use it as a sign bit. Make the sign bit the most significand bit in the encoding and let 0 denote positive numbers and let 1 denote negative numbers. Our encoding now looks like this.
0 0 0 0 0 0 -> 0/8 * 2^2 = 0 // denormalized, with a bias of -2 0 0 0 0 0 1 -> 1/8 * 2^2 = 1/2 // subtracted from each exponent, and 0 0 0 0 1 0 -> 2/8 * 2^2 = 1 // the binary point is to the left of the significand 0 0 0 0 1 1 -> 3/8 * 2^2 = 3/2 0 0 0 1 0 0 -> 4/8 * 2^2 = 2 0 0 0 1 0 1 -> 5/8 * 2^2 = 5/2 0 0 0 1 1 0 -> 6/8 * 2^2 = 3 0 0 0 1 1 1 -> 7/8 * 2^2 = 7/2 0 0 1 0 0 0 -> 8/8 * 2^2 = 4 // normalized, with an implied 1 as 0 0 1 0 0 1 -> 9/8 * 2^2 = 9/2 // the MSB of each significand, 0 0 1 0 1 0 -> 10/8 * 2^2 = 5 // a bias of -1 subtracted from each exponent, and 0 0 1 0 1 1 -> 11/8 * 2^2 = 11/2 // the binary point is to the right of the implied MSB 0 0 1 1 0 0 -> 12/8 * 2^2 = 6 0 0 1 1 0 1 -> 13/8 * 2^2 = 13/2 0 0 1 1 1 0 -> 14/8 * 2^2 = 7 0 0 1 1 1 1 -> 15/8 * 2^2 = 15/2 0 1 0 0 0 0 -> 8/8 * 2^3 = 8 0 1 0 0 0 1 -> 9/8 * 2^3 = 9 0 1 0 0 1 0 -> 10/8 * 2^3 = 10 0 1 0 0 1 1 -> 11/8 * 2^3 = 11 0 1 0 1 0 0 -> 12/8 * 2^3 = 12 0 1 0 1 0 1 -> 13/8 * 2^3 = 13 0 1 0 1 1 0 -> 14/8 * 2^3 = 14 0 1 0 1 1 1 -> 15/8 * 2^3 = 15 0 1 1 0 0 0 -> 8/8 * 2^4 = 16 0 1 1 0 0 1 -> 9/8 * 2^4 = 18 0 1 1 0 1 0 -> 10/8 * 2^4 = 20 0 1 1 0 1 1 -> 11/8 * 2^4 = 22 0 1 1 1 0 0 -> 12/8 * 2^4 = 24 0 1 1 1 0 1 -> 13/8 * 2^4 = 26 0 1 1 1 1 0 -> 14/8 * 2^4 = 28 0 1 1 1 1 1 -> 15/8 * 2^4 = 30 1 0 0 0 0 0 -> 0/8 * 2^2 = -0 // denormalized, with a bias of -2 1 0 0 0 0 1 -> 1/8 * 2^2 = -1/2 // subtracted from each exponent, and 1 0 0 0 1 0 -> 2/8 * 2^2 = -1 // the binary point is to the left of the significand 1 0 0 0 1 1 -> 3/8 * 2^2 = -3/2 1 0 0 1 0 0 -> 4/8 * 2^2 = -2 1 0 0 1 0 1 -> 5/8 * 2^2 = -5/2 1 0 0 1 1 0 -> 6/8 * 2^2 = -3 1 0 0 1 1 1 -> 7/8 * 2^2 = -7/2 1 0 1 0 0 0 -> 8/8 * 2^2 = -4 // normalized, with an implied 1 as 1 0 1 0 0 1 -> 9/8 * 2^2 = -9/2 // the MSB of each significand, 1 0 1 0 1 0 -> 10/8 * 2^2 = -5 // a bias of -1 subtracted from each exponent, and 1 0 1 0 1 1 -> 11/8 * 2^2 = -11/2 // the binary point is to the right of the implied MSB 1 0 1 1 0 0 -> 12/8 * 2^2 = -6 1 0 1 1 0 1 -> 13/8 * 2^2 = -13/2 1 0 1 1 1 0 -> 14/8 * 2^2 = -7 1 0 1 1 1 1 -> 15/8 * 2^2 = -15/2 1 1 0 0 0 0 -> 8/8 * 2^3 = -8 1 1 0 0 0 1 -> 9/8 * 2^3 = -9 1 1 0 0 1 0 -> 10/8 * 2^3 = -10 1 1 0 0 1 1 -> 11/8 * 2^3 = -11 1 1 0 1 0 0 -> 12/8 * 2^3 = -12 1 1 0 1 0 1 -> 13/8 * 2^3 = -13 1 1 0 1 1 0 -> 14/8 * 2^3 = -14 1 1 0 1 1 1 -> 15/8 * 2^3 = -15 1 1 1 0 0 0 -> 8/8 * 2^4 = -16 1 1 1 0 0 1 -> 9/8 * 2^4 = -18 1 1 1 0 1 0 -> 10/8 * 2^4 = -20 1 1 1 0 1 1 -> 11/8 * 2^4 = -22 1 1 1 1 0 0 -> 12/8 * 2^4 = -24 1 1 1 1 0 1 -> 13/8 * 2^4 = -26 1 1 1 1 1 0 -> 14/8 * 2^4 = -28 1 1 1 1 1 1 -> 15/8 * 2^4 = -30
Notice that the positive numbers are in the correct binary order, but the negative numbers are not. But it is still possible, with a little bit of effort, to use integer comparison to check the order of these "floating-point numbers".
Given a binary word and information about which bits are the exponent, which bits are the significand, and what the bias is, you should now be able to interpret the binary word as a floating-point number.
For example, here is a 32 bit word.
0 01001010 11001000000000000000000 (0x25640000)
This word is in the IEEE single precision format which has the most significant bit as a sign bit, the next eight bits as the exponent, and the remaining 23 bits as the significand. The bias is 126 for denormalized numbers and 127 for the normalized numbers.
The exponent is 74. So the exponent minus the bias gives 74 - 127 = -53. The significand, with its implicit MSB, is 1.11001. This is 1 + 1/2 + 1/4 + 1/32 = 57/32 = 1.78125. So the number represented by this encoding is
1.78125 * 2^(-53) = 1.9775848e-16 = 0.00000000000000019775848.
You can check this answer using this calculator or this calculator.
To give you a better sense of what makes floating-point number systems interesting, consider the following table. It shows two different 6-bit floating-point number systems. The system on the left uses 3 bits for the significand, 3 bits for the exponent, and a bias of 4. The system on the right uses 4 bits for the significand, 2 bits for the exponent, and a bias of -1.
The two systems have nearly the same range. But they distribute the fractional numbers and the integers through the range in very different ways. The system on the right has every integer from 0 to 31, but the system on the left skips every other integer after 16. The system on the left has more precise small numbers but there are no more fractional numbers after 7.5. The system on the right has imprecise small numbers, but it has some fractions all the way up to 15.5. Which of these systems is "better"? Given that you have to use 6 bits (or 32 bits, or 64 bits) which floating-point system would be more useful for scientific calculations? These are the kinds of questions computer scientists had to answer when they designed the IEEE floating-point number system.
0 0 0, 0 0 0 -> 0 * 2^(-3) = 0 0 0 0 0, 0 0 -> 0/8 * 2^1 = 0 0 0 1, 0 0 0 -> 1/4 * 2^(-3) = 1/32 0 0 0 1, 0 0 -> 1/8 * 2^1 = 1/4 0 1 0, 0 0 0 -> 2/4 * 2^(-3) = 2/32 0 0 1 0, 0 0 -> 2/8 * 2^1 = 1/2 0 1 1, 0 0 0 -> 3/4 * 2^(-3) = 3/32 0 0 1 1, 0 0 -> 3/8 * 2^1 = 3/4 1 0 0, 0 0 0 -> 4/4 * 2^(-3) = 4/32 0 1 0 0, 0 0 -> 4/8 * 2^1 = 1 1 0 1, 0 0 0 -> 5/4 * 2^(-3) = 5/32 0 1 0 1, 0 0 -> 5/8 * 2^1 = 5/4 1 1 0, 0 0 0 -> 6/4 * 2^(-3) = 6/32 0 1 1 0, 0 0 -> 6/8 * 2^1 = 3/2 1 1 1, 0 0 0 -> 7/4 * 2^(-3) = 7/32 0 1 1 1, 0 0 -> 7/8 * 2^1 = 7/4 0 0 0, 0 0 1 -> 8/8 * 2^(-2) = 8/32 1 0 0 0, 0 0 -> 8/8 * 2^1 = 2 0 0 1, 0 0 1 -> 9/8 * 2^(-2) = 9/32 1 0 0 1, 0 0 -> 9/8 * 2^1 = 9/4 0 1 0, 0 0 1 -> 10/8 * 2^(-2) = 10/32 1 0 1 0, 0 0 -> 10/8 * 2^1 = 5/2 0 1 1, 0 0 1 -> 11/8 * 2^(-2) = 11/32 1 0 1 1, 0 0 -> 11/8 * 2^1 = 11/4 1 0 0, 0 0 1 -> 12/8 * 2^(-2) = 12/32 1 1 0 0, 0 0 -> 12/8 * 2^1 = 3 1 0 1, 0 0 1 -> 13/8 * 2^(-2) = 13/32 1 1 0 1, 0 0 -> 13/8 * 2^1 = 13/4 1 1 0, 0 0 1 -> 14/8 * 2^(-2) = 14/32 1 1 1 0, 0 0 -> 14/8 * 2^1 = 7/2 1 1 1, 0 0 1 -> 15/8 * 2^(-2) = 15/32 1 1 1 1, 0 0 -> 15/8 * 2^1 = 15/4 0 0 0, 0 1 0 -> 8/8 * 2^(-1) = 1/2 0 0 0 0, 0 1 -> 16/16 * 2^2 = 4 0 0 1, 0 1 0 -> 9/8 * 2^(-1) = 9/16 0 0 0 1, 0 1 -> 17/16 * 2^2 = 17/4 0 1 0, 0 1 0 -> 10/8 * 2^(-1) = 10/16 0 0 1 0, 0 1 -> 18/16 * 2^2 = 18/4 0 1 1, 0 1 0 -> 11/8 * 2^(-1) = 11/16 0 0 1 1, 0 1 -> 19/16 * 2^2 = 19/4 1 0 0, 0 1 0 -> 12/8 * 2^(-1) = 12/16 0 1 0 0, 0 1 -> 20/16 * 2^2 = 5 1 0 1, 0 1 0 -> 13/8 * 2^(-1) = 13/16 0 1 0 1, 0 1 -> 21/16 * 2^2 = 21/4 1 1 0, 0 1 0 -> 14/8 * 2^(-1) = 14/16 0 1 1 0, 0 1 -> 22/16 * 2^2 = 22/4 1 1 1, 0 1 0 -> 15/8 * 2^(-1) = 15/16 0 1 1 1, 0 1 -> 23/16 * 2^2 = 23/4 0 0 0, 0 1 1 -> 8/8 * 2^0 = 1 1 0 0 0, 0 1 -> 24/16 * 2^2 = 6 0 0 1, 0 1 1 -> 9/8 * 2^0 = 9/8 1 0 0 1, 0 1 -> 25/16 * 2^2 = 25/4 0 1 0, 0 1 1 -> 10/8 * 2^0 = 10/8 1 0 1 0, 0 1 -> 26/16 * 2^2 = 26/4 0 1 1, 0 1 1 -> 11/8 * 2^0 = 11/8 1 0 1 1, 0 1 -> 27/16 * 2^2 = 27/4 1 0 0, 0 1 1 -> 12/8 * 2^0 = 12/8 1 1 0 0, 0 1 -> 28/16 * 2^2 = 7 1 0 1, 0 1 1 -> 13/8 * 2^0 = 13/8 1 1 0 1, 0 1 -> 29/16 * 2^2 = 29/4 1 1 0, 0 1 1 -> 14/8 * 2^0 = 14/8 1 1 1 0, 0 1 -> 30/16 * 2^2 = 30/4 1 1 1, 0 1 1 -> 15/8 * 2^0 = 15/8 1 1 1 1, 0 1 -> 31/16 * 2^2 = 31/4 0 0 0, 1 0 0 -> 8/8 * 2^1 = 2 0 0 0 0, 1 0 -> 16/16 * 2^3 = 8 0 0 1, 1 0 0 -> 9/8 * 2^1 = 9/4 0 0 0 1, 1 0 -> 17/16 * 2^3 = 17/2 0 1 0, 1 0 0 -> 10/8 * 2^1 = 10/4 0 0 1 0, 1 0 -> 18/16 * 2^3 = 9 0 1 1, 1 0 0 -> 11/8 * 2^1 = 11/4 0 0 1 1, 1 0 -> 19/16 * 2^3 = 19/2 1 0 0, 1 0 0 -> 12/8 * 2^1 = 3 0 1 0 0, 1 0 -> 20/16 * 2^3 = 10 1 0 1, 1 0 0 -> 13/8 * 2^1 = 13/4 0 1 0 1, 1 0 -> 21/16 * 2^3 = 21/2 1 1 0, 1 0 0 -> 14/8 * 2^1 = 14/4 0 1 1 0, 1 0 -> 22/16 * 2^3 = 11 1 1 1, 1 0 0 -> 15/8 * 2^1 = 15/4 0 1 1 1, 1 0 -> 23/16 * 2^3 = 23/2 0 0 0, 1 0 1 -> 8/8 * 2^2 = 4 1 0 0 0, 1 0 -> 24/16 * 2^3 = 12 0 0 1, 1 0 1 -> 9/8 * 2^2 = 9/2 1 0 0 1, 1 0 -> 25/16 * 2^3 = 25/2 0 1 0, 1 0 1 -> 10/8 * 2^2 = 5 1 0 1 0, 1 0 -> 26/16 * 2^3 = 13 0 1 1, 1 0 1 -> 11/8 * 2^2 = 11/2 1 0 1 1, 1 0 -> 27/16 * 2^3 = 27/2 1 0 0, 1 0 1 -> 12/8 * 2^2 = 6 1 1 0 0, 1 0 -> 28/16 * 2^3 = 14 1 0 1, 1 0 1 -> 13/8 * 2^2 = 13/2 1 1 0 1, 1 0 -> 29/16 * 2^3 = 29/2 1 1 0, 1 0 1 -> 14/8 * 2^2 = 7 1 1 1 0, 1 0 -> 30/16 * 2^3 = 15 1 1 1, 1 0 1 -> 15/8 * 2^2 = 15/2 1 1 1 1, 1 0 -> 31/16 * 2^3 = 31/2 0 0 0, 1 1 0 -> 8/8 * 2^3 = 8 0 0 0 0, 1 1 -> 16/16 * 2^4 = 16 0 0 1, 1 1 0 -> 9/8 * 2^3 = 9 0 0 0 1, 1 1 -> 17/16 * 2^4 = 17 0 1 0, 1 1 0 -> 10/8 * 2^3 = 10 0 0 1 0, 1 1 -> 18/16 * 2^4 = 18 0 1 1, 1 1 0 -> 11/8 * 2^3 = 11 0 0 1 1, 1 1 -> 19/16 * 2^4 = 19 1 0 0, 1 1 0 -> 12/8 * 2^3 = 12 0 1 0 0, 1 1 -> 20/16 * 2^4 = 20 1 0 1, 1 1 0 -> 13/8 * 2^3 = 13 0 1 0 1, 1 1 -> 21/16 * 2^4 = 21 1 1 0, 1 1 0 -> 14/8 * 2^3 = 14 0 1 1 0, 1 1 -> 22/16 * 2^4 = 22 1 1 1, 1 1 0 -> 15/8 * 2^3 = 15 0 1 1 1, 1 1 -> 23/16 * 2^4 = 23 0 0 0, 1 1 1 -> 8/8 * 2^4 = 16 1 0 0 0, 1 1 -> 24/16 * 2^4 = 24 0 0 1, 1 1 1 -> 9/8 * 2^4 = 18 1 0 0 1, 1 1 -> 25/16 * 2^4 = 25 0 1 0, 1 1 1 -> 10/8 * 2^4 = 20 1 0 1 0, 1 1 -> 26/16 * 2^4 = 26 0 1 1, 1 1 1 -> 11/8 * 2^4 = 22 1 0 1 1, 1 1 -> 27/16 * 2^4 = 27 1 0 0, 1 1 1 -> 12/8 * 2^4 = 24 1 1 0 0, 1 1 -> 28/16 * 2^4 = 28 1 0 1, 1 1 1 -> 13/8 * 2^4 = 26 1 1 0 1, 1 1 -> 29/16 * 2^4 = 29 1 1 0, 1 1 1 -> 14/8 * 2^4 = 28 1 1 1 0, 1 1 -> 30/16 * 2^4 = 30 1 1 1, 1 1 1 -> 15/8 * 2^4 = 30 1 1 1 1, 1 1 -> 31/16 * 2^4 = 31
To complete this explanation of floating-point numbers, we would need to discus the algorithms used to add, subtract, multiply, and divide these numbers. And we would also need to discus how to round the result of an arithmetic operation to a represented number.