## Understanding Floating Point Formats

Under ordinary circumstances, you don’t have to know or care how numbers are represented within your programs. However, when you are transferring data files that contain numbers, you will have to convert if the storage formats are not identical. If the numbers are just integers, that’s fairly easy because the only differences will be the length and the byte order: how many bytes the number takes up, and whether it is stored lsb or msb (least signifacant byte or most significant byte first). Once you know that, conversion is trivial.

Floating point numbers are a whole other game. For example, in December of 1983, I had to convert some Tandy Basic programs and data files to Xenix MBASIC. The Basic programs themselves were fairly challenging, but the data files were even more so. Tandy stored floating point nubers in what they called "XS128 notation" (Excess 128 is what they really meant) and MBASIC used packed BCD. At the time, I had never given a single thought to how floating point numbers are stored. As you surely realize, this was long before you could ask Google to find you something like MAD 3401 IEEE Floating-Point Notes, and the availability of computer oriented books was not anything like it is today. I was on my own, with only "od -cx", my wits, and pure stubbornness to go on. There was an explanation in the manuals, but it was typical geek-babble and it made my head hurt. It took me several hours of painful work to understand what I needed to do, and a few hours more to write programs to do it, but the project got done. I haven’t had to do anything like that since then, and you may never have had to at all, but that doesn’t that neither of us ever will. So rather than you getting a headache from trying to puzzle it out (because there’s still a lot of techno-babble out there) , I’ll get you started.

The first thing you need to know is that your machine may give different results than mine. It probably won’t unless you are using something odd, but if it does, don’t panic: the theory is still the same; you just have a slightly different implementation. Here’s a Perl program that is going to show us what’s going on (you do not need to understand this script):

#!/usr/bin/perl

showbits(0);

for ($x=1; $x < 16384; $x*=2) {

showbits($x);

}

showbits("5.75");

showbits("-.1");

sub showbits {

$x=shift;

$string=pack("f",$x);

print "$x\t";

$y=uc(unpack("H*",$string));

print "$y\t";

for ($z=0;$z<8;$z+=2) {

$hx[$z]=sprintf("%.8b ",hex(substr($y,$z,2)));

}

print substr($hx[0],0,1), " ";

print substr($hx[0],1,7);

print substr($hx[2],0,1), " ";

print substr($hx[2],1,7);

print substr($hx[4],0,8);

print substr($hx[6],0,8);

print "\n";

}

We’re looking at single precision floating point numbers here. Double precision uses the same scheme, just more bits. Here’s what the output looks like :

0 00000000 0 00000000 00000000000000000000000

1 3F800000 0 01111111 00000000000000000000000

2 40000000 0 10000000 00000000000000000000000

4 40800000 0 10000001 00000000000000000000000

8 41000000 0 10000010 00000000000000000000000

16 41800000 0 10000011 00000000000000000000000

32 42000000 0 10000100 00000000000000000000000

64 42800000 0 10000101 00000000000000000000000

128 43000000 0 10000110 00000000000000000000000

256 43800000 0 10000111 00000000000000000000000

512 44000000 0 10001000 00000000000000000000000

1024 44800000 0 10001001 00000000000000000000000

2048 45000000 0 10001010 00000000000000000000000

4096 45800000 0 10001011 00000000000000000000000

8192 46000000 0 10001100 00000000000000000000000

5.75 40B80000 0 10000001 01110000000000000000000

-.1 BDCCCCCD 1 01111011 10011001100110011001101

The first column is what the stored format looks like in hex. After that come the actual bits; I’ve separated them in this odd way for a very good reason (which will become clear later). The value "5.75" is stored as "01000000101110000000000000000000" or "40B80000" (hex).

You might easily guess that the first bit is the sign bit. I think that’s what I first grokked back in 1983 too. The next 8 bits are used for the exponent, and the last 23 are the value. As you will no doubt notice, the value bits from 0 to 8192 are all empty, so I must be crazy and there’s no point in reading this trash any farther.

Well, actually there is. There’s a hidden bit there that isn’t stored but is always assumed. If you are really compulsive and counted the bits, you see that only 23 bits are there. The hidden bit makes it 24.bits (or 4 bytes) and is always 1. So, if we add the hidden bit, the bits would look like:

0 0 00000000 100000000000000000000000

1 0 01111111 100000000000000000000000

2 0 10000000 100000000000000000000000

4 0 10000001 100000000000000000000000

8 0 10000010 100000000000000000000000

16 0 10000011 100000000000000000000000

32 0 10000100 100000000000000000000000

64 0 10000101 100000000000000000000000

128 0 10000110 100000000000000000000000

256 0 10000111 100000000000000000000000

512 0 10001000 100000000000000000000000

1024 0 10001001 100000000000000000000000

2048 0 10001010 100000000000000000000000

4096 0 10001011 100000000000000000000000

8192 0 10001100 100000000000000000000000

5.75 0 10000001 101110000000000000000000

-.1 1 01111011 110011001100110011001101

But remember, it’s what I showed above that is really there.

One more thing: there’s an implied decimal point after that hidden number. To get the value of bits after the decimal point, start dividing by two: so the first bit after the (implied) decimal point is .5, the next is .25 and so on. We don’t have to worry about any of that for the powers of two, because obviously those are whole numbers and the bits will be all 0. But down at the 5.75 we see that at work:

First, looking at the exponent for 5.75, we see that it is 129. Subtracting 127 gives us 2. So 1.0111 times 2^2 becomes 101.11 (simply shift 2 places to the right to multiply by 4). So now we have 101 binary, which is 5, plus .5 plus .25 (.11) or 5.75 in total. Too quick?

Taking it in detail:

Exponent: 10000001, which is 129 (use the Javascript Bit Twiddler if you like). Subtract 127 leaves us with 2.

Mantissa: 01110000000000000000000

Add in the implied bit and we have 101110000000000000000000, with implied decimal point that’s 1.01110000000000000000000

Multiple that by 2^2 to get 101.110000000000000000000

That is 4 + 1 + .5 + .25 or 5.75

Look at 2048. The exponent is 128 + 8 + 2 or 138, subtract 127 we get 11. Use the Bit Twiddle if you don’t see that. The mantissa is all 0’s, which with the implied bit makes this all 1.00000000000000000000000 times 2^11. What’s 2^11? It’s 2048, of course.

Now the -.1. This actually can’t store store precisely, but the method is still the same. The exponent is 64 + 32 + 16 + 8 + 2 + 1 or 123. Subtract 127 and we get -4, which means the decimal point moves 4 places to the left, making our value .000110011001100110011001101. Now you understand why it’s stored after adding 127 – it’s so we can end up with negative exponents. If we calculate out the binary, that’s .625 + .3125 + .0390625 and on to ever smaller numbers which get us very, very close to .1 (but off slightly). The sign bit was set, so it’s a -.1

The Tandy (and Dec Vax, by the way) "excess 128" exponent storage simply changes the ranges of positive versus negative numbers – other than that, it works just like this.

Finally, there are two reserved values: all 0’s for 0, and all 1’s for NaN (Not A Number) in other words, too large (or too small) for the format to hold. You’d also get that from dividing by zero.

That’s it. Take a look at the link at the beginning if you want to go a little deeper, but this is probably all you need to get started.

## Leave A Comment

You must be logged in to post a comment.