Conversion of Floating Point Numbers in Matlab

In the last post on floating point numbers, I presented a brief overview of floating point numbers, introduced several Matlab functions that provide information about floats (realmin, realmax, and eps), and explored the workings of eps. In this post, I would like to introduce a function that I wrote in Matlab to convert a floating point number to its binary representation and use that function to explain the floating-point representations of ten different numbers.

The motivation for this code is that viewing the binary form of a floating point number is instructive and useful, and there is no function that performs this operation included with the core Matlab software. After briefly searching the Matlab File Exchange, I was unable to find a function that did exactly this conversion. There are, however, some interesting functions that convert the integer and fractional parts of the decimal number to binary numbers separately, but none that convert a decimal floating point number to its actual binary encoding. The function that I wrote, float2bin, does that and is displayed and described below.

function b = float2bin(f)
%This function converts a floating point number to its binary form.

%Input error handling
if ~isfloat(f)
  disp('Input must be a floating point number.');
  return;
end

%Hex characters
hex = '0123456789abcdef'; 

%Convert from float to hex
h = num2hex(f); 

%Convert to cell array of chars
hc = num2cell(h); 

%Convert to array of numbers
nums =  cellfun(@(x) find(hex == x) - 1, hc); 

%Convert to array of binary numbers
bins = dec2bin(nums, 4); 

%Reshape into horizontal vector
b = reshape(bins.', 1, numel(bins));

This code begins with an error-checking condition that verifies that the input is a floating point number. If this is not the case, the function displays an error message and exits without returning an output. Otherwise, preparation for the conversion from decimal to binary commences with the creation of a string of hexadecimal characters (0 – f), which are used to convert hex characters to binary numbers. The conversion is performed in five steps, the first of which is the conversion from decimal to hexadecimal. Because there is no native Matlab function that directly performs the conversion from floating point decimal to binary, the floating point number must be converted to a hexadecimal format that corresponds with the binary structure of the floating point number. This is accomplished with num2hex, which does most of the heavy lifting for us. Without this function, we would have to delve into the details of the floating point format to accomplish this conversion.

After using num2hex, we have a string of hex characters that reveal the floating point format, and we only need to convert the hex characters to binary. To achieve this task, we first convert the hex string into a cell array of characters with num2cell in order to perform individual operations on each character. This conversion enables the use of cellfun, a function that was described in a recent MLG post, to convert each hex character into an integer in the range of 0 to 15. The position of each character in hc is found in hex, the string of hex characters that was created at the beginning of the function, which also corresponds to its integer value with an offset of -1. Next, each integer that represents a hex character is converted into binary with dec2bin. The second argument is set to 4 in order to force each binary number to have four digits. One feature of dec2bin is somewhat inconvenient for our purposes; specifically, this function arranges the converted values of nums into a vertically oriented character array with a width of 4. This problem is remedied with a call to reshape, which rearranges bins into a horizontal character array. Note that bins must be transposed before being reshaped because reshape accesses the elements of its input array in column-major order. If any part of this process is unclear, I recommend entering the commands sequentially in the Matlab command window and examining the output after every step. I have avoided listing the results after every step in this post for the sake of brevity.

Of course, it would be possible to convert a decimal floating point number to its binary encoding by manipulating the numbers directly and not using num2hex. This could be accomplished by converting the number to binary by successive division by powers of 2, extracting the exponent and significand, rounding to the available number of significand digits, and assigning the sign bit. This would also involve control statements to handle denormal numbers, signed zeros, infinities, and NaNs. Despite its educational value, this process would undoubtably be slower than using num2hex and is unnecessary in Matlab.

Examples of floats in binary

Now that float2bin has been explained, let’s utilize this function to examine the binary representations of ten different floating point numbers. These concrete examples will hopefully help familiarize you with the floating point format. All of the examples use double precision, which is Matlab’s default numeric data type.

>> float2bin(0.1)
ans = 0011111110111001100110011001100110011001100110011001100110011010
>> float2bin(2)
ans = 0100000000000000000000000000000000000000000000000000000000000000
>> float2bin(0)
ans = 0000000000000000000000000000000000000000000000000000000000000000
>> float2bin(-0)
ans = 1000000000000000000000000000000000000000000000000000000000000000
>> float2bin(realmin)
ans = 0000000000010000000000000000000000000000000000000000000000000000
>> float2bin(realmax)
ans = 0111111111101111111111111111111111111111111111111111111111111111
>> float2bin(eps(0))
ans = 0000000000000000000000000000000000000000000000000000000000000001
>> float2bin(Inf)
ans = 0111111111110000000000000000000000000000000000000000000000000000
>> float2bin(-Inf)
ans = 1111111111110000000000000000000000000000000000000000000000000000
>> float2bin(NaN)
ans = 0111111111111000000000000000000000000000000000000000000000000000

The first example shows the binary representation of 0.1. This frequently used example cannot be represented precisely in binary with a fixed number of digits, which leads to problems in calculations. Because doubles only have 52 significand bits, this pattern stops abruptly at the least significant digit. The second example shows one of the simplest binary numbers to represent: 2. The sign bit is zero, which means that the number is positive and that the exponent bits (2 to 12 in this format) are 10000000000, which is equal to 1024. There is a bias of -1023, which means that the unbiased exponent is 1. All of the digits of the significand are 0, which means that the signficand is 1.00….0, taking into account the implicit leading 1. Thus, this format represents 1.0 x 2^1 = 2.

The third and fourth examples show how +0 and -0 are represented in binary. All of the exponent and significand bits are 0, while the sign bit is 0 for +0 and 1 for -0. The fourth and fifth examples show the smallest and largest normalized numbers, realmin and realmax, respectively. Notice that they differ in every bit except for the sign bit. A normal number can be calculated from its floating-point representation with the following formula:

N = (-1)-s × ( d0 + d1β-1 + … + dp-1-p ) × β^[ (e0βE-1 e1βE-2 + … + eE-1 ) – B],

where N is the normal number, s is the sign bit, d0, …, dp-1 are the digits of the significand, p is the number of digits of the significand, e0, …,eE-1 are the digits of the exponent, E is the number of digits in the exponent, β is the base (2 for binary), and B is the exponent bias. Contrary to many representations of the floating point format, I have given the least significant digits of the exponent and significand the largest subscripts. Consequently, our printed binary representations are organized as follows:

s e0 e1 … e10 d1 d2 … d52

Therefore, realmin and realmax can be computed as follows:

realmin = (-1)^0 × [ 1 ] × 2^( 1  – 1023) = 2^-1022 = 2.225 × 10^-308

realmax = (-1)^0 × [ 1 + 2^-1 + … + 2^-52 ] × 2^( 2^10 + 2^9 + … + 0  – 1023)
                = [ 2 – 2^-52 ] × 2^1023 = 1.798 × 10^308

To understand how the summation of the powers of 2 from 1 to 2-52 equals 2 – 2-52, I refer you to Wolfram Mathworld’s webpage on geometric series.

Before we explain the last four examples, notice that the exponents of realmin and realmax are 0…01 and 1…10, which have biased values of 1 and 2046 and unbiased values of -1022 and 1023, respectively. The exponent bit values of 0…0 and 1…1 are reserved for special numbers, including subnormals, infinities, and NaNs.

The seventh example is eps(0), which is the smallest subnormal number that can be represented by a double. Note that all the exponent bits are 0, indicating that this is a subnormal number; consequently, the leading digit (d0) is 0, and the exponent equals the bias plus 1. Only the last digit in the significand is 1; thus, eps(0) can be computed as follows:

eps(0) = (-1) ^0 × [ 2 ^-52 ] × 2^(- 1023 + 1) = 2^-1074 = 4.940 × 10^-324

The eighth and ninth examples show the binary floating-point representations of Inf and -Inf, respectively. Note that all the significand bits are 0, all the exponent bits are 1, and the only difference between Inf and -Inf is the sign bit. Finally, the last example shows one type of NaN in floating-point binary. All of the exponent bits and the first significand bit are 1 for this type of NaN. The floating-point binary representations of NaNs resulting from various calculations are slightly different. I encourage you to discover these differences by using float2bin.

This concludes the second post in our series on floating point numbers. I hope that you find float2bin useful and invite you to use it freely for any purpose, similar to any of the other code that we present in our posts. In the next post, I will detail the inverse function of float2bin, which is named bin2float and converts binary floating-point representations to their decimal equivalents. I will also demonstrate how to convert between various floating-point binary representations.

 

8 thoughts on “Conversion of Floating Point Numbers in Matlab

Leave a Reply

Your email address will not be published. Required fields are marked *